36,983 research outputs found

    Statistical methods for the analysis of RNA sequencing data

    Get PDF
    The next generation sequencing technology, RNA-sequencing (RNA-seq), has an increasing popularity over traditional microarrays in transcriptome analyses. Statistical methods used for gene expression analyses with these two technologies are different because the array-based technology measures intensities using continuous distributions, whereas RNA-seq provides absolute quantification of gene expression using counts of reads. There is a need for reliable statistical methods to exploit the information from the rapidly evolving sequencing technologies and limited work has been done on expression analysis of time-course RNA-seq data. In this dissertation, we propose a model-based clustering method for identifying gene expression patterns in time-course RNA-seq data. Our approach employs a longitudinal negative binomial mixture model to postulate the over-dispersed time-course gene count data. We also modify existing common initialization procedures to suit our model-based clustering algorithm. The effectiveness of the proposed methods is assessed using simulated data and is illustrated by real data from time-course genomic experiments. Another common issue in gene expression analysis is the presence of missing values in the datasets. Various treatments to missing values in genomic datasets have been developed but limited work has been done on RNA-seq data. In the current work, we examine the performance of various imputation methods and their impact on the clustering of time-course RNA-seq data. We develop a cluster-based imputation method which is specifically suitable for dealing with missing values in RNA-seq datasets. Simulation studies are provided to assess the performance of the proposed imputation approach

    A new graph-based clustering method with application to single-cell RNA-seq data from human pancreatic islets.

    Get PDF
    Traditional bulk RNA-sequencing of human pancreatic islets mainly reflects transcriptional response of major cell types. Single-cell RNA sequencing technology enables transcriptional characterization of individual cells, and thus makes it possible to detect cell types and subtypes. To tackle the heterogeneity of single-cell RNA-seq data, powerful and appropriate clustering is required to facilitate the discovery of cell types. In this paper, we propose a new clustering framework based on a graph-based model with various types of dissimilarity measures. We take the compositional nature of single-cell RNA-seq data into account and employ log-ratio transformations. The practical merit of the proposed method is demonstrated through the application to the centered log-ratio-transformed single-cell RNA-seq data for human pancreatic islets. The practical merit is also demonstrated through comparisons with existing single-cell clustering methods. The R-package for the proposed method can be found at https://github.com/Zhang-Data-Science-Research-Lab/LrSClust

    Self-supervised deep clustering of single-cell RNA-seq data to hierarchically detect rare cell populations.

    Get PDF
    Single-cell RNA sequencing (scRNA-seq) is a widely used technique for characterizing individual cells and studying gene expression at the single-cell level. Clustering plays a vital role in grouping similar cells together for various downstream analyses. However, the high sparsity and dimensionality of large scRNA-seq data pose challenges to clustering performance. Although several deep learning-based clustering algorithms have been proposed, most existing clustering methods have limitations in capturing the precise distribution types of the data or fully utilizing the relationships between cells, leaving a considerable scope for improving the clustering performance, particularly in detecting rare cell populations from large scRNA-seq data. We introduce DeepScena, a novel single-cell hierarchical clustering tool that fully incorporates nonlinear dimension reduction, negative binomial-based convolutional autoencoder for data fitting, and a self-supervision model for cell similarity enhancement. In comprehensive evaluation using multiple large-scale scRNA-seq datasets, DeepScena consistently outperformed seven popular clustering tools in terms of accuracy. Notably, DeepScena exhibits high proficiency in identifying rare cell populations within large datasets that contain large numbers of clusters. When applied to scRNA-seq data of multiple myeloma cells, DeepScena successfully identified not only previously labeled large cell types but also subpopulations in CD14 monocytes, T cells and natural killer cells, respectively

    Model-based deep autoencoders for characterizing discrete data with application to genomic data analysis

    Get PDF
    Deep learning techniques have achieved tremendous successes in a wide range of real applications in recent years. For dimension reduction, deep neural networks (DNNs) provide a natural choice to parameterize a non-linear transforming function that maps the original high dimensional data to a lower dimensional latent space. Autoencoder is a kind of DNNs used to learn efficient feature representation in an unsupervised manner. Deep autoencoder has been widely explored and applied to analysis of continuous data, while it is understudied for characterizing discrete data. This dissertation focuses on developing model-based deep autoencoders for modeling discrete data. A motivating example of discrete data is the count data matrix generated by single-cell RNA sequencing (scRNA-seq) technology which is widely used in biological and medical fields. scRNA-seq promises to provide higher resolution of cellular differences than bulk RNA sequencing and has helped researchers to better understand complex biological questions. The recent advances in sequencing technology have enabled a dramatic increase in the throughput to thousands of cells for scRNA-seq. However, analysis of scRNA-seq data remains a statistical and computational challenge. A major problem is the pervasive dropout events obscuring the discrete matrix with prevailing \u27false\u27 zero count observations, which is caused by the shallow sequencing depth per cell. To make downstream analysis more effective, imputation, which recovers the missing values, is often conducted as the first step in preprocessing scRNA-seq data. Several imputation methods have been proposed. Of note is a deep autoencoder model, which proposes to explicitly characterize the count distribution, over-dispersion, and sparsity of scRNA-seq data using a zero-inflated negative binomial (ZINB) model. This dissertation introduces a model-based deep learning clustering model ? scDeepCluster for clustering analysis of scRNA-seq data. The scDeepCluster is a deep autoencoder which simultaneously learns feature representation and clustering via explicit modeling of scRNA-seq data generation using the ZINB model. Based on testing extensive simulated datasets and real datasets from different representative single-cell sequencing platforms, scDeepCluster outperformed several state-of-the-art methods under various clustering performance metrics and exhibited improved scalability, with running time increasing linearly with the sample size. Although this model-based deep autoencoder approach has demonstrated superior performance, it is over-permissive in defining ZINB model space, which can lead to an unidentifiable model and make results unstable. Next, this dissertation proposes to impose a regularization that takes dropout events into account. The regularization uses a differentiable categorical distribution - Gumbel-Softmax to explicitly model the dropout events, and minimizes the Maximum Mean Discrepancy (MMD) between the reconstructed randomly masked matrix and the raw count matrix. Imputation analyses showed that the proposed regularized model-based autoencoder significantly outperformed the vanilla model-based deep autoencoder

    Model-based deep autoencoders for clustering single-cell RNA sequencing data with side information

    Get PDF
    Clustering analysis has been conducted extensively in single-cell RNA sequencing (scRNA-seq) studies. scRNA-seq can profile tens of thousands of genes\u27 activities within a single cell. Thousands or tens of thousands of cells can be captured simultaneously in a typical scRNA-seq experiment. Biologists would like to cluster these cells for exploring and elucidating cell types or subtypes. Numerous methods have been designed for clustering scRNA-seq data. Yet, single-cell technologies develop so fast in the past few years that those existing methods do not catch up with these rapid changes and fail to fully fulfil their potential. For instance, besides profiling transcription expression levels of genes, recent single-cell technologies can capture other auxiliary information at the single-cell level, such as protein expression (multi-omics scRNA-seq) and cells\u27 spatial location information (spatial-resolved scRNA-seq). Most existing clustering methods for scRNA-seq are performed in an unsupervised manner and fail to exploit available side information for optimizing clustering performance. This dissertation focuses on developing novel computational methods for clustering scRNA-seq data. The basic models are built on a deep autoencoder (AE) framework, which is coupled with a ZINB (zero-inflated negative binomial) loss to characterize the zero-inflated and over-dispersed scRNA-seq count data. To integrate multi-omics scRNA-seq data, a multimodal autoencoder (MAE) is employed. It applies one encoder for the multimodal inputs and two decoders for reconstructing each omics of data. This model is named scMDC (Single-Cell Multi-omics Deep Clustering). Besides, it is expected that cells in spatial proximity tend to be of the same cell types. To exploit cellular spatial information available for spatial-resolved scRNA-seq (sp-scRNA-seq) data, a novel model, DSSC (Deep Spatial-constrained Single-cell Clustering), is developed. DSSC integrates the spatial information of cells into the clustering process by two steps: 1) the spatial information is encoded by using a graphical neural network model; 2) cell-to-cell constraints are built based on the spatially expression pattern of the marker genes and added in the model to guide the clustering process. DSSC is the first model which can utilize the information from both the spatial coordinates and the marker genes to guide the cell/spot clustering. For both scMDC and DSSC, a clustering loss is optimized on the bottleneck layer of autoencoder along with the learning of feature representation. Extensive experiments on both simulated and real datasets demonstrate that scMDC and DSSC boost clustering performance significantly while costing no extra time and space during the training process. These models hold great promise as valuable tools for harnessing the full potential of state-of-the-art single-cell data

    Novel statistical methods in analyzing single cell sequencing data

    Get PDF
    Understanding biological systems requires the knowledge of their individual components. Single cell RNA sequencing (scRNA-Seq) becomes a revolutionary tool to investigate cell-to-cell transcriptomic heterogeneity, which cannot be obtained in population-averaged measurements such as the bulk RNA-Seq. This dissertation focuses on developing novel statistical methods for analyzing droplet-based single cell data, which includes clustering methods to identify cell types from single or multiple individuals, and a joint clustering approach to analyze paired data from Cellular Indexing of Transcriptomes and Epitopes by sequencing (CITE-Seq), a new state-of-art technology that allows the detection of cell surface proteins and transcriptome profiling within the same cell simultaneously. In the first part of this dissertation, I developed DIMM-SC, a Dirichlet mixture model which explicitly models the raw UMI count for clustering droplet-based scRNA-Seq data and produces cluster membership with uncertainties. Both simulation studies and real data applications demonstrated that overall, DIMM-SC achieves substantially improved clustering accuracy and much lower clustering variability compared to other clustering methods. In the second part, I developed BAMM-SC, a novel Bayesian hierarchical Dirichlet mixture model to cluster droplet-based scRNA-Seq data from population studies. BAMM-SC takes raw count data as input and accounts for data heterogeneity and batch effect among multiple individuals in a unified Bayesian hierarchical model framework. Extensive simulation studies and applications to multiple in house scRNA-Seq datasets demonstrated that BAMM-SC outperformed existing clustering methods with improved clustering accuracy. In the third part, I developed BREM-SC, a novel random effects model that jointly cluster the paired data from CITE-Seq simultaneously. Simulations and analysis of in-house real data sets were performed, which successfully demonstrated the validity and advantages of our method in understanding the heterogeneity and dynamics of various cell populations. Contribution to public health: Recent droplet-based single cell sequencing technology and its extensions have brought revolutionary insights to the understanding of cell heterogeneity and molecular processes at single cell resolution. I believe the proposed statistical approaches in this dissertation for single cell data will help us fully understand cell identity and function. This will promote the innovation for the traditional public health and medical research

    Novel Bayesian methodology for the analysis of single-cell RNA sequencing data.

    Get PDF
    With single-cell RNA sequencing (scRNA-seq) technology, researchers are able to gain a better understanding of health and disease through the analysis of gene expression data at the cellular-level; however, scRNA-seq data tend to have high proportions of zero values, increased cell-to-cell variability, and overdispersion due to abnormally large expression counts, which create new statistical problems that need to be addressed. This dissertation includes three research projects that propose Bayesian methodology suitable for scRNA-seq analysis. In the first project, a hurdle model for identifying differentially expressed genes across cell types in scRNA-seq data is presented. This model incorporates a correlated random effects structure based on an initial clustering of cells to capture the cell-to-cell variability within treatment groups but can easily be adapted to an independent random effect structure if needed. A sparse Bayesian factor model is introduced in the second project to uncover network structures associated with genes in scRNA-seq data. Latent factors impact the gene expression values for each cell and provide flexibility to account for the common features of scRNA-seq. The third project expands upon this latent factor model to allow for the comparison of networks across different treatment groups

    Single-cell RNA-seq data analysis using graph autoencoders and graph attention networks

    Get PDF
    With the development of high-throughput sequencing technology, the scale of single-cell RNA sequencing (scRNA-seq) data has surged. Its data are typically high-dimensional, with high dropout noise and high sparsity. Therefore, gene imputation and cell clustering analysis of scRNA-seq data is increasingly important. Statistical or traditional machine learning methods are inefficient, and improved accuracy is needed. The methods based on deep learning cannot directly process non-Euclidean spatial data, such as cell diagrams. In this study, we developed scGAEGAT, a multi-modal model with graph autoencoders and graph attention networks for scRNA-seq analysis based on graph neural networks. Cosine similarity, median L1 distance, and root-mean-squared error were used to measure the gene imputation performance of different methods for comparison with scGAEGAT. Furthermore, adjusted mutual information, normalized mutual information, completeness score, and Silhouette coefficient score were used to measure the cell clustering performance of different methods for comparison with scGAEGAT. Experimental results demonstrated promising performance of the scGAEGAT model in gene imputation and cell clustering prediction on four scRNA-seq data sets with gold-standard cell labels
    corecore