99 research outputs found

    PANTHER: Pathway Augmented Nonnegative Tensor factorization for HighER-order feature learning

    Full text link
    Genetic pathways usually encode molecular mechanisms that can inform targeted interventions. It is often challenging for existing machine learning approaches to jointly model genetic pathways (higher-order features) and variants (atomic features), and present to clinicians interpretable models. In order to build more accurate and better interpretable machine learning models for genetic medicine, we introduce Pathway Augmented Nonnegative Tensor factorization for HighER-order feature learning (PANTHER). PANTHER selects informative genetic pathways that directly encode molecular mechanisms. We apply genetically motivated constrained tensor factorization to group pathways in a way that reflects molecular mechanism interactions. We then train a softmax classifier for disease types using the identified pathway groups. We evaluated PANTHER against multiple state-of-the-art constrained tensor/matrix factorization models, as well as group guided and Bayesian hierarchical models. PANTHER outperforms all state-of-the-art comparison models significantly (p<0.05). Our experiments on large scale Next Generation Sequencing (NGS) and whole-genome genotyping datasets also demonstrated wide applicability of PANTHER. We performed feature analysis in predicting disease types, which suggested insights and benefits of the identified pathway groups.Comment: Accepted by 35th AAAI Conference on Artificial Intelligence (AAAI 2021

    New approaches for unsupervised transcriptomic data analysis based on Dictionary learning

    Get PDF
    The era of high-throughput data generation enables new access to biomolecular profiles and exploitation thereof. However, the analysis of such biomolecular data, for example, transcriptomic data, suffers from the so-called "curse of dimensionality". This occurs in the analysis of datasets with a significantly larger number of variables than data points. As a consequence, overfitting and unintentional learning of process-independent patterns can appear. This can lead to insignificant results in the application. A common way of counteracting this problem is the application of dimension reduction methods and subsequent analysis of the resulting low-dimensional representation that has a smaller number of variables. In this thesis, two new methods for the analysis of transcriptomic datasets are introduced and evaluated. Our methods are based on the concepts of Dictionary learning, which is an unsupervised dimension reduction approach. Unlike many dimension reduction approaches that are widely applied for transcriptomic data analysis, Dictionary learning does not impose constraints on the components that are to be derived. This allows for great flexibility when adjusting the representation to the data. Further, Dictionary learning belongs to the class of sparse methods. The result of sparse methods is a model with few non-zero coefficients, which is often preferred for its simplicity and ease of interpretation. Sparse methods exploit the fact that the analysed datasets are highly structured. Indeed, a characteristic of transcriptomic data is particularly their structuredness, which appears due to the connection of genes and pathways, for example. Nonetheless, the application of Dictionary learning in medical data analysis is mainly restricted to image analysis. Another advantage of Dictionary learning is that it is an interpretable approach. Interpretability is a necessity in biomolecular data analysis to gain a holistic understanding of the investigated processes. Our two new transcriptomic data analysis methods are each designed for one main task: (1) identification of subgroups for samples from mixed populations, and (2) temporal ordering of samples from dynamic datasets, also referred to as "pseudotime estimation". Both methods are evaluated on simulated and real-world data and compared to other methods that are widely applied in transcriptomic data analysis. Our methods convince through high performance and overall outperform the comparison methods

    Linking Functional Brain Networks To Psychopathology And Beyond

    Get PDF
    Neurobiological abnormalities associated with neuropsychiatric disorders do not map well to existing diagnostic categories. High co-morbidity suggests dimensional circuit-level abnormalities that cross diagnoses. As neuropsychiatric disorders are increasingly reconceptualized as disorders of brain development, deviations from normative brain network reconfiguration during development are hypothesized to underlie many illness that arise in young adulthood. In this dissertation, we first applied recent advances in machine learning to a large imaging dataset of youth (n=999) to delineate brain-guided dimensions of psychopathology across clinical diagnostic boundaries. Specifically, using sparse Canonical Correlation Analysis, an unsupervised learning method that seeks to capture sources of variation common to two high-dimensional datasets, we discovered four linked dimensions of psychopathology and connectivity in functional brain networks, namely, mood, psychosis, fear, and externalizing behavior. While each dimension exhibited an unique pattern of functional brain connectivity, loss of network segregation between the default mode and executive networks emerged as a shared connectopathy common across four dimensions of psychopathology. Building upon this work, in the second part of the dissertation, we designed, implemented, and deployed a new penalized statistical learning approach, Multi-Scale Network Regression (MSNR), to study brain network connectivity and a wide variety of phenotypes, beyond psychopathology. MSNR explicitly respects both edge- and community-level information by assuming a low rank and sparse structure, both encouraging less complex and more interpretably modeling. Capitalizing on a large neuroimaging cohort (n=1,051), we demonstrated that MSNR recapitulated interpretably and statistically significant associations between functional connectivity patterns with brain development, sex differences, and motion-related artifacts. Compared to common single-scale approaches, MSNR achieved a balance between prediction performance and model complexity, with improved interpretability. Together, integrating recent advances in multiple disciplines across machine learning, network science, developmental neuroscience, and psychiatry, this body of work fits into the broader context of computational psychiatry, where there is intense interest in the quest of delineating brain network patterns associated with psychopathology, among a diverse range of phenotypes

    Machine learning approaches to model cardiac shape in large-scale imaging studies

    Get PDF
    Recent improvements in non-invasive imaging, together with the introduction of fully-automated segmentation algorithms and big data analytics, has paved the way for large-scale population-based imaging studies. These studies promise to increase our understanding of a large number of medical conditions, including cardiovascular diseases. However, analysis of cardiac shape in such studies is often limited to simple morphometric indices, ignoring large part of the information available in medical images. Discovery of new biomarkers by machine learning has recently gained traction, but often lacks interpretability. The research presented in this thesis aimed at developing novel explainable machine learning and computational methods capable of better summarizing shape variability, to better inform association and predictive clinical models in large-scale imaging studies. A powerful and flexible framework to model the relationship between three-dimensional (3D) cardiac atlases, encoding multiple phenotypic traits, and genetic variables is first presented. The proposed approach enables the detection of regional phenotype-genotype associations that would be otherwise neglected by conventional association analysis. Three learning-based systems based on deep generative models are then proposed. In the first model, I propose a classifier of cardiac shapes which exploits task-specific generative shape features, and it is designed to enable the visualisation of the anatomical effect these features encode in 3D, making the classification task transparent. The second approach models a database of anatomical shapes via a hierarchy of conditional latent variables and it is capable of detecting, quantifying and visualising onto a template shape the most discriminative anatomical features that characterize distinct clinical conditions. Finally, a preliminary analysis of a deep learning system capable of reconstructing 3D high-resolution cardiac segmentations from a sparse set of 2D views segmentations is reported. This thesis demonstrates that machine learning approaches can facilitate high-throughput analysis of normal and pathological anatomy and of its determinants without losing clinical interpretability.Open Acces

    Right ventricular biomechanics in pulmonary hypertension

    Get PDF
    As outcome in pulmonary hypertension is strongly associated with progressive right ventricular dysfunction, the work in this thesis seeks to determine the regional distribution of forces on the right ventricle, its geometry, and deformations subsequent to load. This thesis contributes to the understanding of how circulating biomarkers of energy metabolism and stress-response pathways are related to adverse cardiac remodelling and functional decompensation. A numerical model of the heart was used to derive a three-dimensional representation of right ventricular morphology, function and wall stress in pulmonary hypertension patients. This approach was tested by modelling the effect of pulmonary endarterectomy in patients with chronic thromboembolic disease. The relationship between the cardiac phenotype and 10 circulating metabolites, known to be associated with all-cause mortality, was assessed using mass univariate regression. Increasing afterload (mean pulmonary artery pressure) was significantly associated with hypertrophy of the right ventricular inlet and dilatation, indicative of global eccentric remodelling, and decreased systolic excursion. Right ventricular ejection fraction was found to be negatively associated with 3-hydroxy-3-methylglutarate, N-formylmethionine, and fumarate. Wall stress was related to all-cause mortality and its decrease after pulmonary endarterectomy was associated with a fall in brain natriuretic peptide. Six metabolites were associated with elevated end-systolic wall stress: dehydroepiandrosterone sulfate, N2,N2-dimethylguanosine, N1-methylinosine, 3-hydroxy-3-methylglutarate, N-acetylmethionine, and N-formylmethionine. Metabolic profiles related to energy metabolism and stress-response are associated with elevations in right ventricular end-systolic wall stress that have prognostic significance in pulmonary hypertension patients. These results show that statistical parametric mapping can give regional information on the right ventricle and that metabolic phenotyping, as well as predicting outcomes, provides markers informative of the biomechanical status of the right ventricle in pulmonary hypertension.Open Acces

    Multimodal Data Fusion and Quantitative Analysis for Medical Applications

    Get PDF
    Medical big data is not only enormous in its size, but also heterogeneous and complex in its data structure, which makes conventional systems or algorithms difficult to process. These heterogeneous medical data include imaging data (e.g., Positron Emission Tomography (PET), Computerized Tomography (CT), Magnetic Resonance Imaging (MRI)), and non-imaging data (e.g., laboratory biomarkers, electronic medical records, and hand-written doctor notes). Multimodal data fusion is an emerging vital field to address this urgent challenge, aiming to process and analyze the complex, diverse and heterogeneous multimodal data. The fusion algorithms bring great potential in medical data analysis, by 1) taking advantage of complementary information from different sources (such as functional-structural complementarity of PET/CT images) and 2) exploiting consensus information that reflects the intrinsic essence (such as the genetic essence underlying medical imaging and clinical symptoms). Thus, multimodal data fusion benefits a wide range of quantitative medical applications, including personalized patient care, more optimal medical operation plan, and preventive public health. Though there has been extensive research on computational approaches for multimodal fusion, there are three major challenges of multimodal data fusion in quantitative medical applications, which are summarized as feature-level fusion, information-level fusion and knowledge-level fusion: • Feature-level fusion. The first challenge is to mine multimodal biomarkers from high-dimensional small-sample multimodal medical datasets, which hinders the effective discovery of informative multimodal biomarkers. Specifically, efficient dimension reduction algorithms are required to alleviate "curse of dimensionality" problem and address the criteria for discovering interpretable, relevant, non-redundant and generalizable multimodal biomarkers. • Information-level fusion. The second challenge is to exploit and interpret inter-modal and intra-modal information for precise clinical decisions. Although radiomics and multi-branch deep learning have been used for implicit information fusion guided with supervision of the labels, there is a lack of methods to explicitly explore inter-modal relationships in medical applications. Unsupervised multimodal learning is able to mine inter-modal relationship as well as reduce the usage of labor-intensive data and explore potential undiscovered biomarkers; however, mining discriminative information without label supervision is an upcoming challenge. Furthermore, the interpretation of complex non-linear cross-modal associations, especially in deep multimodal learning, is another critical challenge in information-level fusion, which hinders the exploration of multimodal interaction in disease mechanism. • Knowledge-level fusion. The third challenge is quantitative knowledge distillation from multi-focus regions on medical imaging. Although characterizing imaging features from single lesions using either feature engineering or deep learning methods have been investigated in recent years, both methods neglect the importance of inter-region spatial relationships. Thus, a topological profiling tool for multi-focus regions is in high demand, which is yet missing in current feature engineering and deep learning methods. Furthermore, incorporating domain knowledge with distilled knowledge from multi-focus regions is another challenge in knowledge-level fusion. To address the three challenges in multimodal data fusion, this thesis provides a multi-level fusion framework for multimodal biomarker mining, multimodal deep learning, and knowledge distillation from multi-focus regions. Specifically, our major contributions in this thesis include: • To address the challenges in feature-level fusion, we propose an Integrative Multimodal Biomarker Mining framework to select interpretable, relevant, non-redundant and generalizable multimodal biomarkers from high-dimensional small-sample imaging and non-imaging data for diagnostic and prognostic applications. The feature selection criteria including representativeness, robustness, discriminability, and non-redundancy are exploited by consensus clustering, Wilcoxon filter, sequential forward selection, and correlation analysis, respectively. SHapley Additive exPlanations (SHAP) method and nomogram are employed to further enhance feature interpretability in machine learning models. • To address the challenges in information-level fusion, we propose an Interpretable Deep Correlational Fusion framework, based on canonical correlation analysis (CCA) for 1) cohesive multimodal fusion of medical imaging and non-imaging data, and 2) interpretation of complex non-linear cross-modal associations. Specifically, two novel loss functions are proposed to optimize the discovery of informative multimodal representations in both supervised and unsupervised deep learning, by jointly learning inter-modal consensus and intra-modal discriminative information. An interpretation module is proposed to decipher the complex non-linear cross-modal association by leveraging interpretation methods in both deep learning and multimodal consensus learning. • To address the challenges in knowledge-level fusion, we proposed a Dynamic Topological Analysis framework, based on persistent homology, for knowledge distillation from inter-connected multi-focus regions in medical imaging and incorporation of domain knowledge. Different from conventional feature engineering and deep learning, our DTA framework is able to explicitly quantify inter-region topological relationships, including global-level geometric structure and community-level clusters. K-simplex Community Graph is proposed to construct the dynamic community graph for representing community-level multi-scale graph structure. The constructed dynamic graph is subsequently tracked with a novel Decomposed Persistence algorithm. Domain knowledge is incorporated into the Adaptive Community Profile, summarizing the tracked multi-scale community topology with additional customizable clinically important factors

    Context matters:the power of single-cell analyses in identifying context-dependent effects on gene expression in blood immune cells

    Get PDF
    The human immune system is a complex system that we still do not fully understand. No two humans react in the same way to attacks by bacteria, viruses or fungi. Factors such as genetics, the type of pathogen or previous exposure to the pathogen may explain this diversity in response. Single-cell RNA sequencing (scRNA-seq) is a new technique that enables us to study the gene expression of each cell individually, allowing us to study immune diversity in much greater detail. This increased resolution helps us discern how disease-associated genetic variants actually contribute to disease. In this thesis, I studied the relation between disease-associated genetic variants and gene expression levels in the context of different cell types and pathogen exposures in order to gain insight into the working mechanisms of these variants. For many variants we learnt in which cell types and under which pathogen exposures they affect gene expression, and we were even able to identify changes in gene co-expression, suggesting that disease-associated variants change how our genes interact with each other. With the single-cell field being so new, much of my work was showing the feasibility of using scRNA-seq to study the interplay between genetics and gene expression. To set up future research, we created guidelines for these analyses and established a consortium that brings together many major scientists in the field to enable large-scale studies across an even wider variety of contexts. This final work helps inform current and future large-scale scRNA-seq research
    • …
    corecore