11 research outputs found

    Integrated Multi-omics Analysis Using Variational Autoencoders: Application to Pan-cancer Classification

    Full text link
    Different aspects of a clinical sample can be revealed by multiple types of omics data. Integrated analysis of multi-omics data provides a comprehensive view of patients, which has the potential to facilitate more accurate clinical decision making. However, omics data are normally high dimensional with large number of molecular features and relatively small number of available samples with clinical labels. The "dimensionality curse" makes it challenging to train a machine learning model using high dimensional omics data like DNA methylation and gene expression profiles. Here we propose an end-to-end deep learning model called OmiVAE to extract low dimensional features and classify samples from multi-omics data. OmiVAE combines the basic structure of variational autoencoders with a classification network to achieve task-oriented feature extraction and multi-class classification. The training procedure of OmiVAE is comprised of an unsupervised phase without the classifier and a supervised phase with the classifier. During the unsupervised phase, a hierarchical cluster structure of samples can be automatically formed without the need for labels. And in the supervised phase, OmiVAE achieved an average classification accuracy of 97.49% after 10-fold cross-validation among 33 tumour types and normal samples, which shows better performance than other existing methods. The OmiVAE model learned from multi-omics data outperformed that using only one type of omics data, which indicates that the complementary information from different omics datatypes provides useful insights for biomedical tasks like cancer classification.Comment: 7 pages, 4 figure

    Blood biomarker-based classification study for neurodegenerative diseases

    Get PDF
    \ua9 2023, Springer Nature Limited. As the population ages, neurodegenerative diseases are becoming more prevalent, making it crucial to comprehend the underlying disease mechanisms and identify biomarkers to allow for early diagnosis and effective screening for clinical trials. Thanks to advancements in gene expression profiling, it is now possible to search for disease biomarkers on an unprecedented scale.Here we applied a selection of five machine learning (ML) approaches to identify blood-based biomarkers for Alzheimer\u27s (AD) and Parkinson\u27s disease (PD) with the application of multiple feature selection methods. Based on ROC AUC performance, one optimal random forest (RF) model was discovered for AD with 159 gene markers (ROC-AUC = 0.886), while one optimal RF model was discovered for PD (ROC-AUC = 0.743). Additionally, in comparison to traditional ML approaches, deep learning approaches were applied to evaluate their potential applications in future works. We demonstrated that convolutional neural networks perform consistently well across both the Alzheimer\u27s (ROC AUC = 0.810) and Parkinson\u27s (ROC AUC = 0.715) datasets, suggesting its potential in gene expression biomarker detection with increased tuning of their architecture

    An Enhancement to CNN Approach with Synthesized Image Data for Disease Subtype Classification

    Get PDF
    The introduction of genetic testing has profoundly enhanced the prospects of early detection of diseases and techniques to suggest precision medicines. The subtyping of critical diseases has proven to be an essential part of the development of individualized therapies and has led to deeper insights into the heterogeneity of the disease. Studies suggest that variants in particular genes have significant effects on certain types of immune system cells and are also involved in the risk of certain critical illnesses like cancer. By analyzing the genetic sequence of a patient, disease types and subtypes can be predicted. Recent research work has shown that the CNN\u27s prediction quality within this context using gene intensity features could be improved when the input is structured into 2D images. Constructed from chromosome locations or from transformations involving kPCA, t-SNE, etc., these two-dimensional images express certain types of relationships among the intensity features. While this approach extends the success of convolutional neural networks to non-image data, getting a precise mapping of features on the images to reflect the relationship among the features is hard, if not impossible. To this end, we propose an enhancement to the approach by providing the CNN training procedure with not only the samples of the structured image data but also the samples from the unstructured raw gene expression data in its original form. While the former is fed into the convolutional layers in the network, the latter is input only to the fully connected layers of the network. The proposed method is applied to The Cancer Genome Atlas (TCGA) dataset for cancer subtypes with the median values of the expression level of all expressed genes in an RNA sequence. According to the experiments, our proposed approach can improve the classification accuracy by 2.7% when it is applied to the state-of-the-art method with 2D CNN architecture trained using images that are constructed based on chromosome locations of the genes. When built on top of the method with 2D CNN architecture trained using images that are constructed with transformation process involving t-SNE, classification accuracy is enhanced by 4.7%. For the implementation of the proposed approach on the 1D CNN model using the data structured using covariance between the features, the classification accuracy is improved by 1% and an increase of 3% is observed when the approach is implemented over the model trained using 1D CNN with data ordered based on chromosome locations

    Algorithms for Inferring Multiple Microbial Networks

    Get PDF
    The interactions among the constituent members of a microbial community play a major role in determining the overall behavior of the community and the abundance levels of its members. These interactions can be modeled using a network whose nodes represent microbial taxa and edges represent pairwise interactions. A microbial network is a weighted graph that is constructed from a sample-taxa count matrix and can be used to model co-occurrences and/or interactions of the constituent members of a microbial community. The nodes in this graph represent microbial taxa and the edges represent pairwise associations amongst these taxa. A microbial network is typically constructed from a sample-taxa count matrix that is obtained by sequencing multiple biological samples and identifying taxa counts. From large-scale microbiome studies, it is evident that microbial community compositions and interactions are impacted by environmental and/or host factors. Thus, it is not unreasonable to expect that a sample-taxa matrix generated as part of a large study involving multiple environmental or clinical parameters can be associated with more than one microbial network. However, to our knowledge, microbial network inference methods proposed thus far assume that the sample-taxa matrix is associated with a single network. This dissertation addresses the scenario when the sample-taxa matrix is associated with K microbial networks and considers the computational problem of inferring K microbial networks from a given sample-taxa matrix. The contributions of this dissertation include 1) new frameworks to generate synthetic sample-taxa count data; 2)novel methods to combine mixture modeling with probabilistic graphical models to infer multiple interaction/association networks from microbial count data; 3) dealing with the compositionality aspect of microbial count data;4) extensive experiments on real and synthetic data; 5)new methods for model selection to infer the correct value of K

    Creación de un sistema para la aplicación de redes neuronales convolucionales en un entorno de visión artificial

    Get PDF
    El trabajo consiste en realizar una aplicación capaz de integrar el lenguaje de programación Python y todas sus funcionalidades en un entorno de C++ para la captura y el procesado de imágenes en tiempo real a partir de una cámara utilizada en entornos de producción de visión artificial. Concretamente, se realizará una aplicación que sea capaz de conectarse a una cámara industrial, configurarla y poder capturar fotos de manera que sea la base para aplicar Deep Learning sobre esas imágenes. Tras ello, y mediante la propia integración de Python en C++, se aplicarán redes neuronales convolucionales a la imagen obtenida por la aplicación, con el fin de obtener un resultado para cada imagen (clasificación).Máster Universitario en Ingeniería Informática por la Universidad Pública de NavarraNafarroako Unibertsitate Publikoko Unibertsitate Masterra Informatika Ingeniaritza

    A computational framework for the comparative analysis of glioma models and patients

    Get PDF
    Diffuse Gliome bei Erwachsenen sind aggressive, unheilbare Hirntumore. Humanisierte Mausmodelle helfen, molekulare Mechanismen zu verstehen und therapeutische Ziele zu identifizieren, aber der Vergleich mit Proben von Patienten gestaltet sich schwierig. Ich habe eine computergestützte Plattform namens CAPE entwickelt, um Tumormodelle und Patienten-Expressionsprofile mit Hilfe der nicht-negativen Matrixfaktorisierung zu vergleichen. Die Anwendung von CAPE auf humanisierte Maus-Gliom-Avatar-Modelle (GSA) und diffuse Glioma-Patienten zeigte eine starke Übereinstimmung zwischen den Modellen und dem proneuralen Glioblastom-Subtyp. CAPE hat gezeigt, dass durch die Transplantation der Erwerb neuer Tumorzustände in den Modellen verbessert wurde. Durch die Kombination von reporterbasiertem genetischem Tracing und CAPE zeigte sich, dass eine Untergruppe der in vivo GSA-Populationen mit Patienten zusammenfällt, die astrozytische Merkmale aufweisen. Die Behandlung von GSA-Modellen in vitro mit menschlichem Serum, TNFα oder ionisierender Strahlung führte zu einer Verschiebung in den mesenchymalen Zustand. Einzelzell-Transkriptomik annotierte GSA-Populationen unter verschiedenen Bedingungen und zeigte alle Glioblastomzustände in vivo und bei Aktivierung durch externe Faktoren. Der Vergleich von GSA-Einzelzellpopulationen und Patienten bestätigte diese Identitäten. Die Studie etablierte einen umfassenden Rahmen für die Erprobung und Validierung von Verbesserungen der Tumormodelle, um Patienten besser abzubilden, und erweiterte das Verständnis der Tumorbiologie und Ansprechen auf Therapie.Adult-type diffuse gliomas are aggressive, incurable adult brain cancers. Humanized mouse models help understand molecular mechanisms and identify therapeutic targets, but comparing them with patient samples is difficult. I developed a computational framework, CAPE, for comparing tumor models and patient expression profiles using non-negative matrix factorization. Applying CAPE to humanized mouse glioma subtype avatar models (GSA) and adult-type diffuse glioma patients revealed a strong resemblance between models and proneural glioblastoma subtype. CAPE showed that transplantation improved new tumor state acquisition in models. Combining genetic tracing reporter phenotypic selection with CAPE showed a subset of in vivo GSA populations clustering with patients having astrocytic-like identities. In vitro treatment of GSA models with human serum, TNFα, or ionizing radiation led to a mesenchymal state shift upon reporter selection. Single-cell transcriptomics annotated GSA populations in different conditions, revealing all glioblastoma states in vivo and upon external factor activation. Comparing GSA single-cell populations and patients confirmed these identities. The study established a comprehensive framework for testing and validating tumor model improvements to resemble patients, advancing tumor biology and treatment response understanding

    Mass spectral imaging of clinical samples using deep learning

    Get PDF
    A better interpretation of tumour heterogeneity and variability is vital for the improvement of novel diagnostic techniques and personalized cancer treatments. Tumour tissue heterogeneity is characterized by biochemical heterogeneity, which can be investigated by unsupervised metabolomics. Mass Spectrometry Imaging (MSI) combined with Machine Learning techniques have generated increasing interest as analytical and diagnostic tools for the analysis of spatial molecular patterns in tissue samples. Considering the high complexity of data produced by the application of MSI, which can consist of many thousands of spectral peaks, statistical analysis and in particular machine learning and deep learning have been investigated as novel approaches to deduce the relationships between the measured molecular patterns and the local structural and biological properties of the tissues. Machine learning have historically been divided into two main categories: Supervised and Unsupervised learning. In MSI, supervised learning methods may be used to segment tissues into histologically relevant areas e.g. the classification of tissue regions in H&E (Haemotoxylin and Eosin) stained samples. Initial classification by an expert histopathologist, through visual inspection enables the development of univariate or multivariate models, based on tissue regions that have significantly up/down-regulated ions. However, complex data may result in underdetermined models, and alternative methods that can cope with high dimensionality and noisy data are required. Here, we describe, apply, and test a novel diagnostic procedure built using a combination of MSI and deep learning with the objective of delineating and identifying biochemical differences between cancerous and non-cancerous tissue in metastatic liver cancer and epithelial ovarian cancer. The workflow investigates the robustness of single (1D) to multidimensional (3D) tumour analyses and also highlights possible biomarkers which are not accessible from classical visual analysis of the H&E images. The identification of key molecular markers may provide a deeper understanding of tumour heterogeneity and potential targets for intervention.Open Acces

    Similaridade em linhas celulares nos sitemas de recomendação farmacológicos para o tratamento oncológico

    Get PDF
    Nas últimas décadas a área da saúde tem-se focado na busca de respostas, cada vez mais personalizadas, para o tratamento das mais variadas patologias. Neste caminho encontra-se o doente oncológico, diferenciando-se dos demais pela complexidade da sua patologia. Neste sentido têm surgido novas disciplinas como: a Bioinformática, a Farmacogenómica, o Machine Learning, o Data Mining, a Genómica, entre outras. A descoberta do sequenciamento genético tem avanços muito significativos nestas áreas, permitindo cada vez mais praticar a chamada medicina de precisão e individualizada para cada doente. Ou seja, cada vez mais o doente é tratado de forma individualizada, com uma determinada patologia, e não um grupo de doentes com características distintas, que detêm a mesma patologia. Será estudada a similaridade entre linhas celulares, tendo por base os Sistemas de Recomendação (RecSys), para o tratamento do doente oncológico. Na implementação deste projeto usar-se-á a metodologia Cross-Industry Standard Process for Data Mining (CRISP DM), onde serão abordadas métricas de similaridade e algoritmos de machine learning, por forma a responder à identificação da similaridade entre linhas celulares. O dataset usado foi o do Genomics of Drug Sensitivity in Cancer (GDSC1), tendo-se selecionado uma amostra de 20 linhas celulares (10 amostras referentes à patologia da mama e 10 amostras referentes a patologias da pele), com 49386 genes cada, dado os recursos de hardware. Para avaliar a similaridade da expressão génica entre estas linhas celulares, serão aplicadas métricas de similaridade, para avaliar 3 genes de uma amostra das 20 linhas celulares, e por outro lado os algoritmos de machine learning onde serão avaliados os 49386 genes de cada amostra das 20 linhas celulares. Assim as métricas de similaridade testadas foram as distâncias de Dice, Jaccard, Sorensen, Czekanowski, Minkowski, Pearson, Intersection, Manhattan, Tanimoto e Euclideana. Na parte dos algoritmos de machine learning foram testados: Rede Neural Artificial, Logistic regression, Linear discriminant analysis, K-Nearest Neighbors, DecisionTreeClassifier, Gaussian NB e Support vector machine. Como conclusão dos resultados obtidos, as distâncias de similaridade com melhores resultados foram Jaccard e Dice, uma vez que apresentaram os resultados mais consistentes para os dois genes selecionados sendo que num dos genes os resultados ainda foram mais consistentes, já os algoritmos que apresentaram uma melhor accuracy foram Logistic Regression, Linear Discriminant Analysis e Gaussian NB
    corecore