128 research outputs found

    Matching single cells across modalities with contrastive learning and optimal transport.

    Get PDF
    Understanding the interactions between the biomolecules that govern cellular behaviors remains an emergent question in biology. Recent advances in single-cell technologies have enabled the simultaneous quantification of multiple biomolecules in the same cell, opening new avenues for understanding cellular complexity and heterogeneity. Still, the resulting multimodal single-cell datasets present unique challenges arising from the high dimensionality and multiple sources of acquisition noise. Computational methods able to match cells across different modalities offer an appealing alternative towards this goal. In this work, we propose MatchCLOT, a novel method for modality matching inspired by recent promising developments in contrastive learning and optimal transport. MatchCLOT uses contrastive learning to learn a common representation between two modalities and applies entropic optimal transport as an approximate maximum weight bipartite matching algorithm. Our model obtains state-of-the-art performance on two curated benchmarking datasets and an independent test dataset, improving the top scoring method by 26.1% while preserving the underlying biological structure of the multimodal data. Importantly, MatchCLOT offers high gains in computational time and memory that, in contrast to existing methods, allows it to scale well with the number of cells. As single-cell datasets become increasingly large, MatchCLOT offers an accurate and efficient solution to the problem of modality matching

    Identifying Cancer Subtypes Using Unsupervised Deep Learning

    Get PDF
    Glioblastoma multiforme (GBM) is the most fatal malignant type of brain tumor with a very poor prognosis with a median survival of around one year. Numerous studies have reported tumor subtypes that consider different characteristics on individual patients, which may play important roles in determining the survival rates in GBM. In this study, we present a pathway-based clustering method using Restricted Boltzmann Machine (RBM), called R-PathCluster, for identifying unknown subtypes with pathway markers of gene expressions. In order to assess the performance of R-PathCluster, we conducted experiments with several clustering methods such as k-means, hierarchical clustering, and RBM models with different input data. R-PathCluster showed the best performance in clustering longterm and short-term survivals, although its clustering score was not the highest among them in experiments. R-PathCluster provides a solution to interpret the model in biological sense, since it takes pathway markers that represent biological process of pathways. We discussed that our findings from R-PathCluster are supported by many biological literatures. Keywords. Glioblastoma multiforme, tumor subtypes, clustering, Restricted Boltzmann Machin

    Biologically Interpretable, Integrative Deep Learning for Cancer Survival Analysis

    Get PDF
    Identifying complex biological processes associated to patients\u27 survival time at the cellular and molecular level is critical not only for developing new treatments for patients but also for accurate survival prediction. However, highly nonlinear and high-dimension, low-sample size (HDLSS) data cause computational challenges in survival analysis. We developed a novel family of pathway-based, sparse deep neural networks (PASNet) for cancer survival analysis. PASNet family is a biologically interpretable neural network model where nodes in the network correspond to specific genes and pathways, while capturing nonlinear and hierarchical effects of biological pathways associated with certain clinical outcomes. Furthermore, integration of heterogeneous types of biological data from biospecimen holds promise of improving survival prediction and personalized therapies in cancer. Specifically, the integration of genomic data and histopathological images enhances survival predictions and personalized treatments in cancer study, while providing an in-depth understanding of genetic mechanisms and phenotypic patterns of cancer. Two proposed models will be introduced for integrating multi-omics data and pathological images, respectively. Each model in PASNet family was evaluated by comparing the performance of current cutting-edge models with The Cancer Genome Atlas (TCGA) cancer data. In the extensive experiments, PASNet family outperformed the benchmarking methods, and the outstanding performance was statistically assessed. More importantly, PASNet family showed the capability to interpret a multi-layered biological system. A number of biological literature in GBM supported the biological interpretation of the proposed models. The open-source software of PASNet family in PyTorch is publicly available at https://github.com/DataX-JieHao

    Fuse: Multiple Network Alignment via Data Fusion

    Get PDF

    Computational Labeling, Partitioning, and Balancing of Molecular Networks

    Get PDF
    Recent advances in high throughput techniques enable large-scale molecular quantification with high accuracy, including mRNAs, proteins and metabolites. Differential expression of these molecules in case and control samples provides a way to select phenotype-associated molecules with statistically significant changes. However, given the significance ranking list of molecular changes, how those molecules work together to drive phenotype formation is still unclear. In particular, the changes in molecular quantities are insufficient to interpret the changes in their functional behavior. My study is aimed at answering this question by integrating molecular network data to systematically model and estimate the changes of molecular functional behaviors. We build three computational models to label, partition, and balance molecular networks using modern machine learning techniques. (1) Due to the incompleteness of protein functional annotation, we develop AptRank, an adaptive PageRank model for protein function prediction on bilayer networks. By integrating Gene Ontology (GO) hierarchy with protein-protein interaction network, our AptRank outperforms four state-of-the-art methods in a comprehensive evaluation using benchmark datasets. (2) We next extend our AptRank into a network partitioning method, BioSweeper, to identify functional network modules in which molecules share similar functions and also densely connect to each other. Compared to traditional network partitioning methods using only network connections, BioSweeper, which integrates the GO hierarchy, can automatically identify functionally enriched network modules. (3) Finally, we conduct a differential interaction analysis, namely difFBA, on protein-protein interaction networks by simulating protein fluxes using flux balance analysis (FBA). We test difFBA using quantitative proteomic data from colon cancer, and demonstrate that difFBA offers more insights into functional changes in molecular behavior than does protein quantity changes alone. We conclude that our integrative network model increases the observational dimensions of complex biological systems, and enables us to more deeply understand the causal relationships between genotypes and phenotypes

    Community Detection in Multimodal Networks

    Get PDF
    Community detection on networks is a basic, yet powerful and ever-expanding set of methodologies that is useful in a variety of settings. This dissertation discusses a range of different community detection on networks with multiple and non-standard modalities. A major focus of analysis is on the study of networks spanning several layers, which represent relationships such as interactions over time, different facets of high-dimensional data. These networks may be represented by several different ways; namely the few-layer (i.e. longitudinal) case as well as the many-layer (time-series cases). In the first case, we develop a novel application of variational expectation maximization as an example of the top-down mode of simultaneous community detection and parameter estimation. In the second case, we use a bottom-up strategy of iterative nodal discovery for these longer time-series, abetted with the assumption of their structural properties. In addition, we explore significantly self-looping networks, whose features are inseparable from the inherent construction of spatial networks whose weights are reflective of distance information. These types of networks are used to model and demarcate geographical regions. We also describe some theoretical properties and applications of a method for finding communities in bipartite networks that are weighted by correlations between samples. We discuss different strategies for community detection in each of these different types of networks, as well as their implications for the broader contributions to the literature. In addition to the methodologies, we also highlight the types of data wherein these ``non-standard" network structures arise and how they are fitting for the applications of the proposed methodologies: particularly spatial networks and multilayer networks. We apply the top-down and bottom-up community detection algorithms to data in the domains of demography, human mobility, genomics, climate science, psychiatry, politics, and neuroimaging. The expansiveness and diversity of these data speak to the flexibility and ubiquity of our proposed methods to all forms of relational data.Doctor of Philosoph
    corecore