10 research outputs found

    Clinical microbiology with multi-view deep probabilistic models

    Get PDF
    Clinical microbiology is one of the critical topics of this century. Identification and discrimination of microorganisms is considered a global public health threat by the main international health organisations, such as World Health Organisation (WHO) or the European Centre for Disease Prevention and Control (ECDC). Rapid spread, high morbidity and mortality, as well as the economic burden associated with their treatment and control are the main causes of their impact. Discrimination of microorganisms is crucial for clinical applications, for instance, Clostridium difficile (C. diff ) increases the mortality and morbidity of healthcare-related infections. Furthermore, in the past two decades, other bacteria, including Klebsiella pneumoniae (K. pneumonia), have demonstrated a significant propensity to acquire antibiotic resistance mechanisms. Consequently, the use of an ineffective antibiotic may result in mortality. Machine Learning (ML) has the potential to be applied in the clinical microbiology field to automatise current methodologies and provide more efficient guided personalised treatments. However, microbiological data are challenging to exploit owing to the presence of a heterogeneous mix of data types, such as real-valued high-dimensional data, categorical indicators, multilabel epidemiological data, binary targets, or even time-series data representations. This problem, which in the field of ML is known as multi-view or multi-modal representation learning, has been studied in other application fields such as mental health monitoring or haematology. Multi-view learning combines different modalities or views representing the same data to extract richer insights and improve understanding. Each modality or view corresponds to a distinct encoding mechanism for the data, and this dissertation specifically addresses the issue of heterogeneity across multiple views. In the probabilistic ML field, the exploitation of multi-view learning is also known as Bayesian Factor Analysis (FA). Current solutions face limitations when handling high-dimensional data and non-linear associations. Recent research proposes deep probabilistic methods to learn hierarchical representations of the data, which can capture intricate non-linear relationships between features. However, some Deep Learning (DL) techniques rely on complicated representations, which can hinder the interpretation of the outcomes. In addition, some inference methods used in DL approaches can be computationally burdensome, which can hinder their practical application in real-world situations. Therefore, there is a demand for more interpretable, explainable, and computationally efficient techniques for highdimensional data. By combining multiple views representing the same information, such as genomic, proteomic, and epidemiologic data, multi-modal representation learning could provide a better understanding of the microbial world. Hence, in this dissertation, the development of two deep probabilistic models, that can handle current limitations in state-of-the-art of clinical microbiology, are proposed. Moreover, both models are also tested in two real scenarios regarding antibiotic resistance prediction in K. pneumoniae and automatic ribotyping of C. diff in collaboration with the Instituto de Investigaci贸n Sanitaria Gregorio Mara帽贸n (IISGM) and the Instituto Ram贸n y Cajal de Investigaci贸n Sanitaria (IRyCIS). The first presented algorithm is the Kernelised Sparse Semi-supervised Heterogeneous Interbattery Bayesian Analysis (SSHIBA). This algorithm uses a kernelised formulation to handle non-linear data relationships while providing compact representations through the automatic selection of relevant vectors. Additionally, it uses an Automatic Relevance Determination (ARD) over the kernel to determine the input feature relevance functionality. Then, it is tailored and applied to the microbiological laboratories of the IISGM and IRyCIS to predict antibiotic resistance in K. pneumoniae. To do so, specific kernels that handle Matrix-Assisted Laser Desorption Ionization (MALDI)-Time-Of-Flight (TOF) mass spectrometry of bacteria are used. Moreover, by exploiting the multi-modal learning between the spectra and epidemiological information, it outperforms other state-of-the-art algorithms. Presented results demonstrate the importance of heterogeneous models that can analyse epidemiological information and can automatically be adjusted for different data distributions. The implementation of this method in microbiological laboratories could significantly reduce the time required to obtain resistance results in 24-72 hours and, moreover, improve patient outcomes. The second algorithm is a hierarchical Variational AutoEncoder (VAE) for heterogeneous data using an explainable FA latent space, called FA-VAE. The FA-VAE model is built on the foundation of the successful KSSHIBA approach for dealing with semi-supervised heterogeneous multi-view problems. This approach further expands the range of data domains it can handle. With the ability to work with a wide range of data types, including multilabel, continuous, binary, categorical, and even image data, the FA-VAE model offers a versatile and powerful solution for real-world data sets, depending on the VAE architecture. Additionally, this model is adapted and used in the microbiological laboratory of IISGM, resulting in an innovative technique for automatic ribotyping of C. diff, using MALDI-TOF data. To the best of our knowledge, this is the first demonstration of using any kind of ML for C. diff ribotyping. Experiments have been conducted on strains of Hospital General Universitario Gregorio Mara帽贸n (HGUGM) to evaluate the viability of the proposed approach. The results have demonstrated high accuracy rates where KSSHIBA even achieved perfect accuracy in the first data collection. These models have also been tested in a real-life outbreak scenario at the HGUGM, where successful classification of all outbreak samples has been achieved by FAVAE. The presented results have not only shown high accuracy in predicting each strain鈥檚 ribotype but also revealed an explainable latent space. Furthermore, traditional ribotyping methods, which rely on PCR, required 7 days while FA-VAE has predicted equal results on the same day. This improvement has significantly reduced the time response by helping in the decision-making of isolating patients with hyper-virulent ribotypes of C. diff on the same day of infection. The promising results, obtained in a real outbreak, have provided a solid foundation for further advancements in the field. This study has been a crucial stepping stone towards realising the full potential of MALDI-TOF for bacterial ribotyping and advancing our ability to tackle bacterial outbreaks. In conclusion, this doctoral thesis has significantly contributed to the field of Bayesian FA by addressing its drawbacks in handling various data types through the creation of novel models, namely KSSHIBA and FA-VAE. Additionally, a comprehensive analysis of the limitations of automating laboratory procedures in the microbiology field has been carried out. The shown effectiveness of the newly developed models has been demonstrated through their successful implementation in critical problems, such as predicting antibiotic resistance and automating ribotyping. As a result, KSSHIBA and FA-VAE, both in terms of their technical and practical contributions, signify noteworthy progress both in the clinical and the Bayesian statistics fields. This dissertation opens up possibilities for future advancements in automating microbiological laboratories.La microbiolog铆a cl铆nica es uno de los temas cr铆ticos de este siglo. La identificaci贸n y discriminaci贸n de microorganismos se considera una amenaza mundial para la salud p煤blica por parte de las principales organizaciones internacionales de salud, como la Organizaci贸n Mundial de la Salud (OMS) o el Centro Europeo para la Prevenci贸n y Control de Enfermedades (ECDC). La r谩pida propagaci贸n, alta morbilidad y mortalidad, as铆 como la carga econ贸mica asociada con su tratamiento y control, son las principales causas de su impacto. La discriminaci贸n de microorganismos es crucial para aplicaciones cl铆nicas, como el caso de Clostridium difficile (C. diff ), el cual aumenta la mortalidad y morbilidad de las infecciones relacionadas con la atenci贸n m茅dica. Adem谩s, en las 煤ltimas dos d茅cadas, otros tipos de bacterias, incluyendo Klebsiella pneumoniae (K. pneumonia), han demostrado una propensi贸n significativa a adquirir mecanismos de resistencia a los antibi贸ticos. En consecuencia, el uso de un antibi贸tico ineficaz puede resultar en un aumento de la mortalidad. El aprendizaje autom谩tico (ML) tiene el potencial de ser aplicado en el campo de la microbiolog铆a cl铆nica para automatizar las metodolog铆as actuales y proporcionar tratamientos personalizados m谩s eficientes y guiados. Sin embargo, los datos microbiol贸gicos son dif铆ciles de explotar debido a la presencia de una mezcla heterog茅nea de tipos de datos, tales como datos reales de alta dimensionalidad, indicadores categ贸ricos, datos epidemiol贸gicos multietiqueta, objetivos binarios o incluso series temporales. Este problema, conocido en el campo del aprendizaje autom谩tico (ML) como aprendizaje multimodal o multivista, ha sido estudiado en otras 谩reas de aplicaci贸n, como en el monitoreo de la salud mental o la hematolog铆a. El aprendizaje multivista combina diferentes modalidades o vistas que representan los mismos datos para extraer conocimientos m谩s ricos y mejorar la comprensi贸n. Cada vista corresponde a un mecanismo de codificaci贸n distinto para los datos, y esta tesis aborda particularmente el problema de la heterogeneidad multivista. En el campo del aprendizaje autom谩tico probabil铆stico, la explotaci贸n del aprendizaje multivista tambi茅n se conoce como An谩lisis de Factores (FA) Bayesianos. Las soluciones actuales enfrentan limitaciones al manejar datos de alta dimensionalidad y correlaciones no lineales. Investigaciones recientes proponen m茅todos probabil铆sticos profundos para aprender representaciones jer谩rquicas de los datos, que pueden capturar relaciones no lineales intrincadas entre caracter铆sticas. Sin embargo, algunas t茅cnicas de aprendizaje profundo (DL) se basan en representaciones complejas, dificultando as铆 la interpretaci贸n de los resultados. Adem谩s, algunos m茅todos de inferencia utilizados en DL pueden ser computacionalmente costosos, obstaculizando su aplicaci贸n pr谩ctica. Por lo tanto, existe una demanda de t茅cnicas m谩s interpretables, explicables y computacionalmente eficientes para datos de alta dimensionalidad. Al combinar m煤ltiples vistas que representan la misma informaci贸n, como datos gen贸micos, prote贸micos y epidemiol贸gicos, el aprendizaje multimodal podr铆a proporcionar una mejor comprensi贸n del mundo microbiano. Dicho lo cual, en esta tesis se proponen el desarrollo de dos modelos probabil铆sticos profundos que pueden manejar las limitaciones actuales en el estado del arte de la microbiolog铆a cl铆nica. Adem谩s, ambos modelos tambi茅n se someten a prueba en dos escenarios reales relacionados con la predicci贸n de resistencia a los antibi贸ticos en K. pneumoniae y el ribotipado autom谩tico de C. diff en colaboraci贸n con el IISGM y el IRyCIS. El primer algoritmo presentado es Kernelised Sparse Semi-supervised Heterogeneous Interbattery Bayesian Analysis (SSHIBA). Este algoritmo utiliza una formulaci贸n kernelizada para manejar correlaciones no lineales proporcionando representaciones compactas a trav茅s de la selecci贸n autom谩tica de vectores relevantes. Adem谩s, utiliza un Automatic Relevance Determination (ARD) sobre el kernel para determinar la relevancia de las caracter铆sticas de entrada. Luego, se adapta y aplica a los laboratorios microbiol贸gicos del IISGM y IRyCIS para predecir la resistencia a antibi贸ticos en K. pneumoniae. Para ello, se utilizan kernels espec铆ficos que manejan la espectrometr铆a de masas Matrix-Assisted Laser Desorption Ionization (MALDI)-Time-Of-Flight (TOF) de bacterias. Adem谩s, al aprovechar el aprendizaje multimodal entre los espectros y la informaci贸n epidemiol贸gica, supera a otros algoritmos de 煤ltima generaci贸n. Los resultados presentados demuestran la importancia de los modelos heterog茅neos ya que pueden analizar la informaci贸n epidemiol贸gica y ajustarse autom谩ticamente para diferentes distribuciones de datos. La implementaci贸n de este m茅todo en laboratorios microbiol贸gicos podr铆a reducir significativamente el tiempo requerido para obtener resultados de resistencia en 24-72 horas y, adem谩s, mejorar los resultados para los pacientes. El segundo algoritmo es un modelo jer谩rquico de Variational AutoEncoder (VAE) para datos heterog茅neos que utiliza un espacio latente con un FA explicativo, llamado FA-VAE. El modelo FA-VAE se construye sobre la base del enfoque de KSSHIBA para tratar problemas semi-supervisados multivista. Esta propuesta ampl铆a a煤n m谩s el rango de dominios que puede manejar incluyendo multietiqueta, continuos, binarios, categ贸ricos e incluso im谩genes. De esta forma, el modelo FA-VAE ofrece una soluci贸n vers谩til y potente para conjuntos de datos realistas, dependiendo de la arquitectura del VAE. Adem谩s, este modelo es adaptado y utilizado en el laboratorio microbiol贸gico del IISGM, lo que resulta en una t茅cnica innovadora para el ribotipado autom谩tico de C. diff utilizando datos MALDI-TOF. Hasta donde sabemos, esta es la primera demostraci贸n del uso de cualquier tipo de ML para el ribotipado de C. diff. Se han realizado experimentos en cepas del Hospital General Universitario Gregorio Mara帽贸n (HGUGM) para evaluar la viabilidad de la t茅cnica propuesta. Los resultados han demostrado altas tasas de precisi贸n donde KSSHIBA incluso logr贸 una clasificaci贸n perfecta en la primera colecci贸n de datos. Estos modelos tambi茅n se han probado en un brote real en el HGUGM, donde FA-VAE logr贸 clasificar con 茅xito todas las muestras del mismo. Los resultados presentados no solo han demostrado una alta precisi贸n en la predicci贸n del ribotipo de cada cepa, sino que tambi茅n han revelado un espacio latente explicativo. Adem谩s, los m茅todos tradicionales de ribotipado, que dependen de PCR, requieren 7 d铆as para obtener resultados mientras que FA-VAE ha predicho resultados correctos el mismo d铆a del brote. Esta mejora ha reducido significativamente el tiempo de respuesta ayudando as铆 en la toma de decisiones para aislar a los pacientes con ribotipos hipervirulentos de C. diff el mismo d铆a de la infecci贸n. Los resultados prometedores, obtenidos en un brote real, han sentado las bases para nuevos avances en el campo. Este estudio ha sido un paso crucial hacia el despliegue del pleno potencial de MALDI-TOF para el ribotipado bacteriana avanzado as铆 nuestra capacidad para abordar brotes bacterianos. En conclusi贸n, esta tesis doctoral ha contribuido significativamente al campo del FA Bayesiano al abordar sus limitaciones en el manejo de tipos de datos heterog茅neos a trav茅s de la creaci贸n de modelos noveles, concretamente, KSSHIBA y FA-VAE. Adem谩s, se ha llevado a cabo un an谩lisis exhaustivo de las limitaciones de la automatizaci贸n de procedimientos de laboratorio en el campo de la microbiolog铆a. La efectividad de los nuevos modelos, en este campo, se ha demostrado a trav茅s de su implementaci贸n exitosa en problemas cr铆ticos, como la predicci贸n de resistencia a los antibi贸ticos y la automatizaci贸n del ribotipado. Como resultado, KSSHIBA y FAVAE, tanto en t茅rminos de sus contribuciones t茅cnicas como pr谩cticas, representan un progreso notable tanto en los campos cl铆nicos como en la estad铆stica Bayesiana. Esta disertaci贸n abre posibilidades para futuros avances en la automatizaci贸n de laboratorios microbiol贸gicos.Programa de Doctorado en Multimedia y Comunicaciones por la Universidad Carlos III de Madrid y la Universidad Rey Juan CarlosPresidente: Juan Jos茅 Murillo Fuentes.- Secretario: Jer贸nimo Arenas Garc铆a.- Vocal: Mar铆a de las Mercedes Mar铆n Arriaz

    Deep transfer learning for drug response prediction

    Get PDF
    The goal of precision oncology is to make accurate predictions for cancer patients via some omics data types of individual patients. Major challenges of computational methods for drug response prediction are that labeled clinical data is very limited, not publicly available, or has drug response for one or two drugs. These challenges have been addressed by generating large-scale pre-clinical datasets such as cancer cell lines or patient-derived xenografts (PDX). These pre-clinical datasets have multi-omics characterization of samples and are often screened with hundreds of drugs which makes them viable resources for precision oncology. However, they raise new questions: how can we integrate different data types? how can we handle data discrepancy between pre-clinical and clinical datasets that exist due to basic biological differences? and how can we make the best use of unlabeled samples in drug response prediction where labeling is extra challenging? In this thesis, we propose methods based on deep neural networks to answer these questions. First, we propose a method of multi-omics integration. Second, we propose a transfer learning method to address data discrepancy between cell lines, patients, and PDX models in the input and output space. Finally, we proposed a semi-supervised method of out-of-distribution generalization to predict drug response using labeled and unlabeled samples. The proposed methods have promising performance when compared to the state-of-the-art and may guide precision oncology more accurately

    Variational autoencoders for tissue heterogeneity exploration from (almost) no preprocessed mass spectrometry imaging data

    No full text
    The paper presents the application of Variational Autoencoders (VAE) for data dimensionality reduction and explorative analysis of mass spectrometry imaging data (MSI). The results confirm that VAEs are capable of detecting the patterns associated with the different tissue sub-types with performance than standard approaches

    Coarse-grained modeling for molecular discovery:Applications to cardiolipin-selectivity

    Get PDF
    The development of novel materials is pivotal for addressing global challenges such as achieving sustainability, technological progress, and advancements in medical technology. Traditionally, developing or designing new molecules was a resource-intensive endeavor, often reliant on serendipity. Given the vast space of chemically feasible drug-like molecules, estimated between 106 - 10100 compounds, traditional in vitro techniques fall short.Consequently, in silico tools such as virtual screening and molecular modeling have gained increasing recognition. However, the computational cost and the limited precision of the utilized molecular models still limit computational molecular design.This thesis aimed to enhance the molecular design process by integrating multiscale modeling and free energy calculations. Employing a coarse-grained model allowed us to efficiently traverse a significant portion of chemical space and reduce the sampling time required by molecular dynamics simulations. The physics-informed nature of the applied Martini force field and its level of retained structural detail make the model a suitable starting point for the focused learning of molecular properties.We applied our proposed approach to a cardiolipin bilayer, posing a relevant and challenging problem and facilitating reasonable comparison to experimental measurements.We identified promising molecules with defined properties within the resolution limit of a coarse-grained representation. Furthermore, we were able to bridge the gap from in silico predictions to in vitro and in vivo experiments, supporting the validity of the theoretical concept. The findings underscore the potential of multiscale modeling and free-energy calculations in enhancing molecular discovery and design and offer a promising direction for future research

    Coarse-grained modeling for molecular discovery:Applications to cardiolipin-selectivity

    Get PDF
    The development of novel materials is pivotal for addressing global challenges such as achieving sustainability, technological progress, and advancements in medical technology. Traditionally, developing or designing new molecules was a resource-intensive endeavor, often reliant on serendipity. Given the vast space of chemically feasible drug-like molecules, estimated between 106 - 10100 compounds, traditional in vitro techniques fall short.Consequently, in silico tools such as virtual screening and molecular modeling have gained increasing recognition. However, the computational cost and the limited precision of the utilized molecular models still limit computational molecular design.This thesis aimed to enhance the molecular design process by integrating multiscale modeling and free energy calculations. Employing a coarse-grained model allowed us to efficiently traverse a significant portion of chemical space and reduce the sampling time required by molecular dynamics simulations. The physics-informed nature of the applied Martini force field and its level of retained structural detail make the model a suitable starting point for the focused learning of molecular properties.We applied our proposed approach to a cardiolipin bilayer, posing a relevant and challenging problem and facilitating reasonable comparison to experimental measurements.We identified promising molecules with defined properties within the resolution limit of a coarse-grained representation. Furthermore, we were able to bridge the gap from in silico predictions to in vitro and in vivo experiments, supporting the validity of the theoretical concept. The findings underscore the potential of multiscale modeling and free-energy calculations in enhancing molecular discovery and design and offer a promising direction for future research

    The data concept behind the data: From metadata models and labelling schemes towards a generic spectral library

    Get PDF
    Spectral libraries play a major role in imaging spectroscopy. They are commonly used to store end-member and spectrally pure material spectra, which are primarily used for mapping or unmixing purposes. However, the development of spectral libraries is time consuming and usually sensor and site dependent. Spectral libraries are therefore often developed, used and tailored only for a specific case study and only for one sensor. Multi-sensor and multi-site use of spectral libraries is difficult and requires technical effort for adaptation, transformation, and data harmonization steps. Especially the huge amount of urban material specifications and its spectral variations hamper the setup of a complete spectral library consisting of all available urban material spectra. By a combined use of different urban spectral libraries, besides the improvement of spectral inter- and intra-class variability, missing material spectra could be considered with respect to a multi-sensor/ -site use. Publicly available spectral libraries mostly lack the metadata information that is essential for describing spectra acquisition and sampling background, and can serve to some extent as a measure of quality and reliability of the spectra and the entire library itself. In the GenLib project, a concept for a generic, multi-site and multi-sensor usable spectral library for image spectra on the urban focus was developed. This presentation will introduce a 1) unified, easy-to-understand hierarchical labeling scheme combined with 2) a comprehensive metadata concept that is 3) implemented in the SPECCHIO spectral information system to promote the setup and usability of a generic urban spectral library (GUSL). The labelling scheme was developed to ensure the translation of individual spectral libraries with their own labelling schemes and their usually varying level of details into the GUSL framework. It is based on a modified version of the EAGLE classification concept by combining land use, land cover, land characteristics and spectral characteristics. The metadata concept consists of 59 mandatory and optional attributes that are intended to specify the spatial context, spectral library information, references, accessibility, calibration, preprocessing steps, and spectra specific information describing library spectra implemented in the GUSL. It was developed on the basis of existing metadata concepts and was subject of an expert survey. The metadata concept and the labelling scheme are implemented in the spectral information system SPECCHIO, which is used for sharing and holding GUSL spectra. It allows easy implementation of spectra as well as their specification with the proposed metadata information to extend the GUSL. Therefore, the proposed data model represents a first fundamental step towards a generic usable and continuously expandable spectral library for urban areas. The metadata concept and the labelling scheme also build the basis for the necessary adaptation and transformation steps of the GUSL in order to use it entirely or in excerpts for further multi-site and multi-sensor applications

    On Improving Generalization of CNN-Based Image Classification with Delineation Maps Using the CORF Push-Pull Inhibition Operator

    Get PDF
    Deployed image classification pipelines are typically dependent on the images captured in real-world environments. This means that images might be affected by different sources of perturbations (e.g. sensor noise in low-light environments). The main challenge arises by the fact that image quality directly impacts the reliability and consistency of classification tasks. This challenge has, hence, attracted wide interest within the computer vision communities. We propose a transformation step that attempts to enhance the generalization ability of CNN models in the presence of unseen noise in the test set. Concretely, the delineation maps of given images are determined using the CORF push-pull inhibition operator. Such an operation transforms an input image into a space that is more robust to noise before being processed by a CNN. We evaluated our approach on the Fashion MNIST data set with an AlexNet model. It turned out that the proposed CORF-augmented pipeline achieved comparable results on noise-free images to those of a conventional AlexNet classification model without CORF delineation maps, but it consistently achieved significantly superior performance on test images perturbed with different levels of Gaussian and uniform noise

    Applications

    Get PDF
    Volume 3 describes how resource-aware machine learning methods and techniques are used to successfully solve real-world problems. The book provides numerous specific application examples: in health and medicine for risk modelling, diagnosis, and treatment selection for diseases in electronics, steel production and milling for quality control during manufacturing processes in traffic, logistics for smart cities and for mobile communications

    Applications

    Get PDF
    Volume 3 describes how resource-aware machine learning methods and techniques are used to successfully solve real-world problems. The book provides numerous specific application examples: in health and medicine for risk modelling, diagnosis, and treatment selection for diseases in electronics, steel production and milling for quality control during manufacturing processes in traffic, logistics for smart cities and for mobile communications
    corecore