380 research outputs found

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    K-Nearest-Neighbors Induced Topological PCA for scRNA Sequence Data Analysis

    Full text link
    Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing downstream analysis. Traditional PCA, a main workhorse in dimensionality reduction, lacks the ability to capture geometrical structure information embedded in the data, and previous graph Laplacian regularizations are limited by the analysis of only a single scale. We propose a topological Principal Components Analysis (tPCA) method by the combination of persistent Laplacian (PL) technique and L2,1_{2,1} norm regularization to address multiscale and multiclass heterogeneity issues in data. We further introduce a k-Nearest-Neighbor (kNN) persistent Laplacian technique to improve the robustness of our persistent Laplacian method. The proposed kNN-PL is a new algebraic topology technique which addresses the many limitations of the traditional persistent homology. Rather than inducing filtration via the varying of a distance threshold, we introduced kNN-tPCA, where filtrations are achieved by varying the number of neighbors in a kNN network at each step, and find that this framework has significant implications for hyper-parameter tuning. We validate the efficacy of our proposed tPCA and kNN-tPCA methods on 11 diverse benchmark scRNA-seq datasets, and showcase that our methods outperform other unsupervised PCA enhancements from the literature, as well as popular Uniform Manifold Approximation (UMAP), t-Distributed Stochastic Neighbor Embedding (tSNE), and Projection Non-Negative Matrix Factorization (NMF) by significant margins.Comment: 28 pages, 11 figure

    Deep Learning in Single-Cell Analysis

    Full text link
    Single-cell technologies are revolutionizing the entire field of biology. The large volumes of data generated by single-cell technologies are high-dimensional, sparse, heterogeneous, and have complicated dependency structures, making analyses using conventional machine learning approaches challenging and impractical. In tackling these challenges, deep learning often demonstrates superior performance compared to traditional machine learning methods. In this work, we give a comprehensive survey on deep learning in single-cell analysis. We first introduce background on single-cell technologies and their development, as well as fundamental concepts of deep learning including the most popular deep architectures. We present an overview of the single-cell analytic pipeline pursued in research applications while noting divergences due to data sources or specific applications. We then review seven popular tasks spanning through different stages of the single-cell analysis pipeline, including multimodal integration, imputation, clustering, spatial domain identification, cell-type deconvolution, cell segmentation, and cell-type annotation. Under each task, we describe the most recent developments in classical and deep learning methods and discuss their advantages and disadvantages. Deep learning tools and benchmark datasets are also summarized for each task. Finally, we discuss the future directions and the most recent challenges. This survey will serve as a reference for biologists and computer scientists, encouraging collaborations.Comment: 77 pages, 11 figures, 15 tables, deep learning, single-cell analysi

    STGIC: a graph and image convolution-based method for spatial transcriptomic clustering

    Full text link
    Spatial transcriptomic (ST) clustering employs spatial and transcription information to group spots spatially coherent and transcriptionally similar together into the same spatial domain. Graph convolution network (GCN) and graph attention network (GAT), fed with spatial coordinates derived adjacency and transcription profile derived feature matrix are often used to solve the problem. Our proposed method STGIC (spatial transcriptomic clustering with graph and image convolution) utilizes an adaptive graph convolution (AGC) to get high quality pseudo-labels and then resorts to dilated convolution framework (DCF) for virtual image converted from gene expression information and spatial coordinates of spots. The dilation rates and kernel sizes are set appropriately and updating of weight values in the kernels is made to be subject to the spatial distance from the position of corresponding elements to kernel centers so that feature extraction of each spot is better guided by spatial distance to neighbor spots. Self-supervision realized by KL-divergence, spatial continuity loss and cross entropy calculated among spots with high confidence pseudo-labels make up the training objective of DCF. STGIC attains state-of-the-art (SOTA) clustering performance on the benchmark dataset of human dorsolateral prefrontal cortex (DLPFC). Besides, it's capable of depicting fine structures of other tissues from other species as well as guiding the identification of marker genes. Also, STGIC is expandable to Stereo-seq data with high spatial resolution.Comment: Major revision has been made to generate the current version as follows: 1. Writing style has been thoroughly changed. 2. Four more datasets have been added. 3. Contrastive learning has been removed since it doesn't make significant difference to the performance. 4. Two more authors are adde

    GSAE: an autoencoder with embedded gene-set nodes for genomics functional characterization

    Full text link
    Bioinformatics tools have been developed to interpret gene expression data at the gene set level, and these gene set based analyses improve the biologists' capability to discover functional relevance of their experiment design. While elucidating gene set individually, inter gene sets association is rarely taken into consideration. Deep learning, an emerging machine learning technique in computational biology, can be used to generate an unbiased combination of gene set, and to determine the biological relevance and analysis consistency of these combining gene sets by leveraging large genomic data sets. In this study, we proposed a gene superset autoencoder (GSAE), a multi-layer autoencoder model with the incorporation of a priori defined gene sets that retain the crucial biological features in the latent layer. We introduced the concept of the gene superset, an unbiased combination of gene sets with weights trained by the autoencoder, where each node in the latent layer is a superset. Trained with genomic data from TCGA and evaluated with their accompanying clinical parameters, we showed gene supersets' ability of discriminating tumor subtypes and their prognostic capability. We further demonstrated the biological relevance of the top component gene sets in the significant supersets. Using autoencoder model and gene superset at its latent layer, we demonstrated that gene supersets retain sufficient biological information with respect to tumor subtypes and clinical prognostic significance. Superset also provides high reproducibility on survival analysis and accurate prediction for cancer subtypes.Comment: Presented in the International Conference on Intelligent Biology and Medicine (ICIBM 2018) at Los Angeles, CA, USA and published in BMC Systems Biology 2018, 12(Suppl 8):14

    Novel Techniques for Single-cell RNA Sequencing Data Imputation and Clustering

    Get PDF
    Advances in single-cell technologies have shifted genomics research from the analysis of bulk tissues toward a comprehensive characterization of individual cells. These cutting-edge approaches enable the in-depth analysis of individual cells, unveiling the remarkable heterogeneity and complexity of cellular systems. By unraveling the unique signatures and functions of distinct cell types, single-cell technologies have not only deepened our understanding of fundamental biological processes but also unlocked new avenues for disease diagnostics and therapeutic interventions.The applications of single-cell technologies extend beyond basic research, with significant implications for precision medicine, drug discovery, and regenerative medicine. By capturing the cellular heterogeneity within tumors, these methods have shed light on the mechanisms of tumor evolution, metastasis, and therapy resistance. Additionally, they have facilitated the identification of rare cell populations with specialized functions, such as stem cells and tissue-resident immune cells, which hold great promise for cell-based therapies.However, one of the major challenges in analyzing scRNA-seq data is the prevalence of dropouts, which are instances where gene expression is not detected despite being present in the cell. Dropouts occur due to technical limitations and can introduce excessive noise into the data, obscuring the true biological signals. As a result, imputation methods are used to estimate missing values and reduce the impact of dropouts on downstream analyses. Furthermore, the high-dimensionality of scRNA-seq data presents additional challenges in effectively partitioning cell populations. Thus, robust computational approaches are required to overcome these challenges and extract meaningful biological insights from single-cell data.There have been numerous imputation and clustering methods developed specifically to address the unique challenges associated with scRNA-seq data analysis. These methods aim to reduce the impact of dropouts and high dimensionality, allowing for accurate cell population partitioning and the discovery of meaningful biological insights. While these methods have unquestionably advanced the field of single-cell transcriptomics, they are not without limitations. Some methods may be computationally intensive, resulting in scalability issues with large datasets, whereas others may introduce biases or overfit the data, potentially affecting the accuracy of subsequent analyses. Furthermore, the performance of these methods can vary depending on the dataset's complexity and heterogeneity. As a result, ongoing research is required to improve existing methodologies and create new algorithms that address these limitations while retaining robustness and accuracy in scRNA-seq data analysis.In this work, we propose three imputation approaches which incorporate with statistical and deep learning framework. We robustly reconstruct the gene expression matrix, effectively mitigating dropout effects and reducing noise. This results in the enhanced recovery of true biological signals from scRNA-seq data and leveraging transcriptomic profiles of single cells. In addition, we introduce a clustering method, which exploits the scRNA-seq data to identify cellular subpopulations. Our method employs a combination of dimensionality reduction and network fusion algorithms to generate a cell similarity graph. This approach accounts for both local and global structure within the data, enabling the discovery of rare and previously unidentified cell populations.We plan to assess the imputation and clustering methods through rigorous benchmarking on simulated and more than 30 real scRNA-seq datasets against existing state-of-the-art techniques. We will show that the imputed data generated from our method can enhance the quality of downstream analyses. Also, we demonstrate that our clustering algorithm is efficient in accurately identifying the cells populations and capable of analyzing big datasets.In conclusion, this thesis propose an alternative approaches to advance current state of scRNA-seq data analysis by developing innovative imputation and clustering methods that enable a more comprehensive and accurate characterization of cellular subpopulations. These advancements potentially have broad applicability in diverse research fields, including developmental biology, immunology, and oncology, where understanding cellular heterogeneity is crucial

    Unsupervised Machine Learning Algorithms to Characterize Single-Cell Heterogeneity and Perturbation Response

    Get PDF
    Recent advances in microfluidic technologies facilitate the measurement of gene expression, DNA accessibility, protein content, or genomic mutations at unprecedented scale. The challenges imposed by the scale of these datasets are further exacerbated by non-linearity in molecular effects, complex interdependencies between features, and a lack of understanding of both data generating processes and sources of technical and biological noise. As a result, analysis of modern single-cell data requires the development of specialized computational tools. One solution to these problems is the use of manifold learning, a sub-field of unsupervised machine learning that seeks to model data geometry using a simplifying assumption that the underlying system is continuous and locally Euclidean. In this dissertation, I show how manifold learning is naturally suited for single-cell analysis and introduce three related algorithms for characterization of single-cell heterogeneity and perturbation response. I first describe Vertex Frequency Clustering, an algorithm that identifies groups of cells with similar responses to an experiment perturbation by analyzing the spectral representation of condition labels expressed as signals over a cell similarity graph. Next, I introduce MELD, an algorithm that expands on these ideas to estimate the density of each experimental sample over the graph to quantify the effect of an experimental perturbation at single cell resolution. Finally, I describe a neural network for archetypal analysis that represents the data as continuously distributed between a set of extrema. Each of these algorithms are demonstrated on a combination of real and synthetic datasets and are benchmarked against state-of-the-art algorithms
    • …
    corecore