712 research outputs found

    reComBat: batch-effect removal in large-scale multi-source gene-expression data integration

    Get PDF
    With the steadily increasing abundance of omics data produced all over the world under vastly different experimental conditions residing in public databases, a crucial step in many data-driven bioinformatics applications is that of data integration. The challenge of batch-effect removal for entire databases lies in the large number of batches and biological variation, which can result in design matrix singularity. This problem can currently not be solved satisfactorily by any common batch-correction algorithm.; We present; reComBat; , a regularized version of the empirical Bayes method to overcome this limitation and benchmark it against popular approaches for the harmonization of public gene-expression data (both microarray and bulkRNAsq) of the human opportunistic pathogen; Pseudomonas aeruginosa; . Batch-effects are successfully mitigated while biologically meaningful gene-expression variation is retained.; reComBat; fills the gap in batch-correction approaches applicable to large-scale, public omics databases and opens up new avenues for data-driven analysis of complex biological processes beyond the scope of a single study.; The code is available at https://github.com/BorgwardtLab/reComBat, all data and evaluation code can be found at https://github.com/BorgwardtLab/batchCorrectionPublicData.; Supplementary data are available at; Bioinformatics Advances; online

    DBnorm as an R package for the comparison and selection of appropriate statistical methods for batch effect correction in metabolomic studies.

    Get PDF
    As a powerful phenotyping technology, metabolomics provides new opportunities in biomarker discovery through metabolome-wide association studies (MWAS) and the identification of metabolites having a regulatory effect in various biological processes. While mass spectrometry-based (MS) metabolomics assays are endowed with high throughput and sensitivity, MWAS are doomed to long-term data acquisition generating an overtime-analytical signal drift that can hinder the uncovering of real biologically relevant changes. We developed "dbnorm", a package in the R environment, which allows for an easy comparison of the model performance of advanced statistical tools commonly used in metabolomics to remove batch effects from large metabolomics datasets. "dbnorm" integrates advanced statistical tools to inspect the dataset structure not only at the macroscopic (sample batches) scale, but also at the microscopic (metabolic features) level. To compare the model performance on data correction, "dbnorm" assigns a score that help users identify the best fitting model for each dataset. In this study, we applied "dbnorm" to two large-scale metabolomics datasets as a proof of concept. We demonstrate that "dbnorm" allows for the accurate selection of the most appropriate statistical tool to efficiently remove the overtime signal drift and to focus on the relevant biological components of complex datasets

    Deep Learning in Single-Cell Analysis

    Full text link
    Single-cell technologies are revolutionizing the entire field of biology. The large volumes of data generated by single-cell technologies are high-dimensional, sparse, heterogeneous, and have complicated dependency structures, making analyses using conventional machine learning approaches challenging and impractical. In tackling these challenges, deep learning often demonstrates superior performance compared to traditional machine learning methods. In this work, we give a comprehensive survey on deep learning in single-cell analysis. We first introduce background on single-cell technologies and their development, as well as fundamental concepts of deep learning including the most popular deep architectures. We present an overview of the single-cell analytic pipeline pursued in research applications while noting divergences due to data sources or specific applications. We then review seven popular tasks spanning through different stages of the single-cell analysis pipeline, including multimodal integration, imputation, clustering, spatial domain identification, cell-type deconvolution, cell segmentation, and cell-type annotation. Under each task, we describe the most recent developments in classical and deep learning methods and discuss their advantages and disadvantages. Deep learning tools and benchmark datasets are also summarized for each task. Finally, we discuss the future directions and the most recent challenges. This survey will serve as a reference for biologists and computer scientists, encouraging collaborations.Comment: 77 pages, 11 figures, 15 tables, deep learning, single-cell analysi

    Data harmonisation for information fusion in digital healthcare: A state-of-the-art systematic review, meta-analysis and future research directions

    Get PDF
    Removing the bias and variance of multicentre data has always been a challenge in large scale digital healthcare studies, which requires the ability to integrate clinical features extracted from data acquired by different scanners and protocols to improve stability and robustness. Previous studies have described various computational approaches to fuse single modality multicentre datasets. However, these surveys rarely focused on evaluation metrics and lacked a checklist for computational data harmonisation studies. In this systematic review, we summarise the computational data harmonisation approaches for multi-modality data in the digital healthcare field, including harmonisation strategies and evaluation metrics based on different theories. In addition, a comprehensive checklist that summarises common practices for data harmonisation studies is proposed to guide researchers to report their research findings more effectively. Last but not least, flowcharts presenting possible ways for methodology and metric selection are proposed and the limitations of different methods have been surveyed for future research

    Comprehensive mapping of tissue cell architecture via integrated single cell and spatial transcriptomics

    Get PDF
    elocation-id: 2020.11.15.378125elocation-id: 2020.11.15.378125The spatial organization of cell types in tissues fundamentally shapes cellular interactions and function, but the high-throughput spatial mapping of complex tissues remains a challenge. We present сell2location, a principled and versatile Bayesian model that integrates single-cell and spatial transcriptomics to map cell types in situ in a comprehensive manner. We show that сell2location outperforms existing tools in accuracy and comprehensiveness and we demonstrate its utility by mapping two complex tissues. In the mouse brain, we use a new paired single nucleus and spatial RNA-sequencing dataset to map dozens of cell types and identify tissue regions in an automated manner. We discover novel regional astrocyte subtypes including fine subpopulations in the thalamus and hypothalamus. In the human lymph node, we resolve spatially interlaced immune cell states and identify co-located groups of cells underlying tissue organisation. We spatially map a rare pre-germinal centre B-cell population and predict putative cellular interactions relevant to the interferon response. Collectively our results demonstrate how сell2location can serve as a versatile first-line analysis tool to map tissue architectures in a high-throughput manner.Competing Interest StatementThe authors have declared no competing interest

    Generative Models of Biological Variations in Bulk and Single-cell RNA-seq

    Get PDF
    The explosive growth of next-generation sequencing data enhances our ability to understand biological process at an unprecedented resolution. Meanwhile organizing and utilizing this tremendous amount of data becomes a big challenge. High-throughput technology provides us a snapshot of all underlying biological activities, but this kind of extremely high-dimensional data is hard to interpret. Due to the curse of dimensionality, the measurement is sparse and far from enough to shape the actual manifold in the high-dimensional space. On the other hand, the measurements may contain structured noise such as technical or nuisance biological variation which can interfere downstream interpretation. Generative modeling is a powerful tool to make sense of the data and generate compact representations summarizing the embedded biological information. This thesis introduces three generative models that help amplifying biological signals buried in the noisy bulk and single-cell RNA-seq data. In Chapter 2, we propose a semi-supervised deconvolution framework called PLIER which can identify regulations in cell-type proportions and specific pathways that control gene expression. PLIER has inspired the development of MultiPLIER and has been used to infer context-specific genotype effects in the brain. In Chapter 3, we construct a supervised transformation named DataRemix to normalize bulk gene expression profiles in order to maximize the biological findings with respect to a variety of downstream tasks. By reweighing the contribution of hidden factors, we are able to reveal the hidden biological signals without any external dataset-specific knowledge. We apply DataRemix to the ROSMAP dataset and report the first replicable trans-eQTL effect in human brain. In Chapter 4, we focus on scRNA-seq and introduce NIFA which is an unsupervised decomposition framework that combines the desired properties of PCA, ICA and NMF. It simultaneously models uni- and multi-modal factors isolating discrete cell-type identity and continuous pathway-level variations into separate components. The work presented in Chapter 2 has been published as a journal article. The work in Chapter 3 and Chapter 4 are under submission and they are available as preprints on bioRxiv

    Representing and extracting knowledge from single cell data

    Full text link
    Single-cell analysis is currently one of the most high-resolution techniques to study biology. The large complex datasets that have been generated have spurred numerous developments in computational biology, in particular the use of advanced statistics and machine learning. This review attempts to explain the deeper theoretical concepts that underpin current state-of-the-art analysis methods. Single-cell analysis is covered from cell, through instruments, to current and upcoming models. A minimum of mathematics and statistics has been used, but the reader is assumed to either have basic knowledge of single-cell analysis workflows, or have a solid knowledge of statistics. The aim of this review is to spread concepts which are not yet in common use, especially from topology and generative processes, and how new statistical models can be developed to capture more of biology. This opens epistemological questions regarding our ontology and models, and some pointers will be given to how natural language processing (NLP) may help overcome our cognitive limitations for understanding single-cell data

    Integrative computational methodologies on single cell datasets

    Get PDF
    High throughput single cell sequencing has seen exciting developments in recent years. With its high resolution characterization of genetics, genomics, proteomics, and epigenomics features, single cell data offer more insights on the underlying biological processes than those from bulk sequencing data. The most well developed single cell technologies are single cell RNA-seq (scRNA-seq) on transcriptomics and flow cytometry on proteomics. Many multi-omics single cell sequencing platforms have also emerged recently, such as CITE-seq, which profiles both epitope and transcriptome simultaneously. But some well known limitations of single cell data, such as batch variations, shallow sequencing depth, and sparsity also present many challenges. Many computational approaches built on machine learning and deep learning methods have been proposed to address these challenges. In this dissertation, I present three computational methods for joint analysis of single cell sequencing data either by multi-omics integration or joint analysis of multiple datasets. In the first chapter, we focus on single cell proteomics data, specifically, the antibody profiling of CITE-seq and cytometry by time of flight (CyTOF) applied to single cells to measure surface marker abundance. Although CyTOF has high accuracy and was introduced earlier than scRNA-seq, there is a lack of computational methods on cell type classification and annotations for these data. We propose a novel automated cell type annotation tool by incorporating CITE-seq data from the same tissue, publicly available annotated scRNA-seq data, and prior knowledge of surface markers in the literature. Our new method, called automated single cell proteomics data annotation approach (ProtAnno), is based on non-negative matrix factorization. We demonstrate the annotation accuracy and robustness of ProtAnno through extensive applications, especially for peripheral blood mononuclear cells (PBMC). The second chapter introduces an integrative method improving bulk sequencing data decomposition into cell type proportions by harmonizing scRNA-seq data across multiple tissues or multiple studies. As a Bayesian model, our method, called tranSig, is able to construct a more reliable signature matrix for decomposition by borrowing information from other tissues and/or studies. Our method can be considered an add-on step in cell type decomposition. Our method can better derive signature gene matrix and better characterize the biological heterogeneity from bulk sequencing datasets. Finally, in the last chapter, we propose a method to jointly analyze scRNA-seq data with summary statistics from genome wide association studies (GWAS). Our method generates a set of SNP (single nucelotide polymorphism)-level weight scores for each cell type or tissue type using scRNA-seq atlas. These scores are combined with risk allele effect sizes to decompose polygenic risk score (PRS) into cell types or tissue types. We show through enrichment analysis and phenome-wide association study (PheWAS) that the decomposed PRSs can better explain the biological mechanisms of genetic effects on complex traits mediated through transcription regulation and the differences across cell types and tissues
    corecore