157 research outputs found

    Detecting and Correcting Contamination in Genetic Data.

    Full text link
    While technological innovation has dramatically increased the amount and variety of genomic data available to geneticists, no assay is perfect and both human error and technical artifacts can lead to erroneous data. A proper analysis pipeline must both detect errors, and, if possible, correct them. One common source of errors in genetic data is sample-to-sample contamination. This dissertation will identify methods to address contamination in the most common types of genetic studies. Chapter 2 focuses on methods for detecting and quantifying contamination in both array-based and next-generation sequencing (NGS) genotype data. For the array-based data, we use the observed intensities from the genotyping instruments to quantify contamination with two distinct methods: 1) a regression-based model using intensities and population allele frequencies and 2) a multivariate normal mixture model that looks at the clustering of intensities. For NGS data, we model the reads using a mixture model to determine the proportion of reads from the true sample and the contaminating sample. Chapter 3 outlines a method to make accurate genotype calls with contaminated NGS data. Given an estimated level of contamination, we propose a likelihood that can be maximized to call genotypes and estimate allele frequencies for samples with no previous genotype data. We investigate the method from data from two common sequencing strategies: 1) low-pass (2-4x depth) genome-wide sequencing and 2) high-depth (50-100x depth) exome sequencing. Chapter 4 looks at contamination in the context of RNA sequencing (RNA-Seq) data. While the technology to generate RNA-Seq data is similar to exome sequencing, the difference in expression between the contaminating and true sample makes it more difficult to accurately estimate the contamination proportion. We propose methods to improve the quality of these estimates.PhDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120783/1/mflick_1.pd

    QuickDeconvolution: fast and scalable deconvolution of linked-reads sequencing data

    Get PDF
    International audienceLinked reads technologies, such as the 10X chromium system, use microfluidics to tag multiple short reads coming from the same long (50-200 kbp) fragment with a small sequence, called barcode. They are cheap and easy to prepare. The fact that reads with the same barcode come from the same fragment of the genome is extremely rich in information and can be used in a myriad of software. However, the same barcode may be used several times for several different fragments, complicating the analyses. Here we present QuickDeconvolution (QD), a new software capable of deconvoluting a set of reads sharing a barcode, i.e. telling separating reads coming from the different fragments. QD only takes as input the sequencing data, without the need for a reference genome. Compared to existing software, we show on made-up examples that QuickDeconvolution is more precise and faster than existing software, especially with many threads. More importantly, it is more scalable and therefore capable of deconvolving datasets that were inaccessible to previous software. We demonstrate here the first example in the literature of a successfully deconvolved animal genome, a Drosophila melanogaster dataset of 33 Gbp

    Enhancing preprocessing and clustering of single-cell RNA sequencing data

    Get PDF
    Single-cell RNA sequencing (scRNA-seq) is the leading technique for characterizing cellular heterogeneity in biological samples. Various scRNA-seq protocols have been developed that can measure the transcriptome from thousands of cells in a single experiment. With these methods readily available, the ability to transform raw data into biological understanding of complex systems is now a rate-limiting step. In this dissertation, I introduce novel computational software and tools which enhance preprocessing and clustering of scRNA-seq data and evaluate their performance compared to existing methods. First, I present scruff, an R/Bioconductor package that preprocesses data generated from scRNA-seq protocols including CEL-Seq or CEL-Seq2 and reports comprehensive data quality metrics and visualizations. scruff rapidly demultiplexes, aligns, and counts the reads mapped to genomic features with deduplication of unique molecular identifier (UMI) tags and provides novel and extensive functions to visualize both pre- and post-alignment data quality metrics for cells from multiple experiments. Second, I present Celda, a novel Bayesian hierarchical model that can perform simultaneous co-clustering of genes into transcriptional modules and cells into subpopulations for scRNA-seq data. Celda identified novel cell subpopulations in a publicly available peripheral blood mononuclear cell (PBMC) dataset and outperformed a PCA-based approach for gene clustering on simulated data. Third, I extend the application of Celda by developing a multimodal clustering method that utilizes both mRNA and protein expression information generated from single-cell sequencing datasets with multiple modalities, and demonstrate that Celda multimodal clustering captured meaningful biological patterns which are missed by transcriptome- or protein-only clustering methods. Collectively, this work addresses limitations present in the computational analyses of scRNA-seq data by providing novel methods and solutions that enhance scRNA-seq data preprocessing and clustering

    Statistical Methods for Human Microbiome Data Analysis

    Get PDF
    The human microbiome is the totality of the microbes, their genetic elements and the interactions they have with surrounding environments throughout the human body. Studies have implicated the human microbiome in health and disease. Two central themes of human microbiome studies are to identify potential factors influencing the microbiome composition, and to define the relationship between microbiome features and biological or clinical outcomes. With the development of next generation sequencing technologies, the human microbiome composition can be interrogated using high-throughput DNA sequencing. One strategy sequences the bacterial 16S ribosomal RNA gene for species identification. These 16S sequences are usually clustered into Operational Taxonomic Units (OTUs). Analysis of such OTU data raises several important statistical challenges, including taking into account the phylogenetic relationship among OTUs and modeling high-dimensional overdispersed count data. This dissertation presents three statistical methods developed specifically for 16S data analysis centering around the two themes. To test the association between overall microbiome composition and a covariate/an outcome, a testing procedure based on a generalized UniFrac distance was developed. The generalized UniFrac distance corrects the unduly weighting of classic UniFrac distances on either highly abundant or rare lineages, and was shown to be more powerful than the classic UniFracs. Under the framework of canonical correlation analysis (CCA), a structure-constrained sparse CCA was proposed to select the OTUs and their correlated covariates. A phylogenetic structure-constrained penalty function was imposed to induce certain smoothness on the linear coefficients according to the OTU phylogenetic relationship. Structure-constrained sparse CCA performed much better than sparse CCA in selecting relevant OTUs. Finally, a sparse Dirichlet-multinomial regression (SDMR) model was developed to link the microbiome composition to environmental covariates and to select the most important covariates and their affected OTUs. SDMR accounts for the overdispersion of OTU counts and uses a sparse group L1 penalty function to facilitate selection of covariates and OTUs simultaneously. These methods were illustrated using simulations as well as a real human gut microbiome data set from a study of dietary effects on gut microbiome composition

    Deep Learning in Single-Cell Analysis

    Full text link
    Single-cell technologies are revolutionizing the entire field of biology. The large volumes of data generated by single-cell technologies are high-dimensional, sparse, heterogeneous, and have complicated dependency structures, making analyses using conventional machine learning approaches challenging and impractical. In tackling these challenges, deep learning often demonstrates superior performance compared to traditional machine learning methods. In this work, we give a comprehensive survey on deep learning in single-cell analysis. We first introduce background on single-cell technologies and their development, as well as fundamental concepts of deep learning including the most popular deep architectures. We present an overview of the single-cell analytic pipeline pursued in research applications while noting divergences due to data sources or specific applications. We then review seven popular tasks spanning through different stages of the single-cell analysis pipeline, including multimodal integration, imputation, clustering, spatial domain identification, cell-type deconvolution, cell segmentation, and cell-type annotation. Under each task, we describe the most recent developments in classical and deep learning methods and discuss their advantages and disadvantages. Deep learning tools and benchmark datasets are also summarized for each task. Finally, we discuss the future directions and the most recent challenges. This survey will serve as a reference for biologists and computer scientists, encouraging collaborations.Comment: 77 pages, 11 figures, 15 tables, deep learning, single-cell analysi

    Computational approaches for single-cell omics and multi-omics data

    Get PDF
    Single-cell omics and multi-omics technologies have enabled the study of cellular heterogeneity with unprecedented resolution and the discovery of new cell types. The core of identifying heterogeneous cell types, both existing and novel ones, relies on efficient computational approaches, including especially cluster analysis. Additionally, gene regulatory network analysis and various integrative approaches are needed to combine data across studies and different multi-omics layers. This thesis comprehensively compared Bayesian clustering models for single-cell RNAsequencing (scRNA-seq) data and selected integrative approaches were used to study the cell-type specific gene regulation of uterus. Additionally, single-cell multi-omics data integration approaches for cell heterogeneity analysis were investigated. Article I investigated analytical approaches for cluster analysis in scRNA-seq data, particularly, latent Dirichlet allocation (LDA) and hierarchical Dirichlet process (HDP) models. The comparison of LDA and HDP together with the existing state-of-art methods revealed that topic modeling-based models can be useful in scRNA-seq cluster analysis. Evaluation of the cluster qualities for LDA and HDP with intrinsic and extrinsic cluster quality metrics indicated that the clustering performance of these methods is dataset dependent. Article II and Article III focused on cell-type specific integrative analysis of uterine or decidual stromal (dS) and natural killer (dNK) cells that are important for successful pregnancy. Article II integrated the existing preeclampsia RNA-seq studies of the decidua together with recent scRNA-seq datasets in order to investigate cell-type-specific contributions of early onset preeclampsia (EOP) and late onset preeclampsia (LOP). It was discovered that the dS marker genes were enriched for LOP downregulated genes and the dNK marker genes were enriched for upregulated EOP genes. Article III presented a gene regulatory network analysis for the subpopulations of dS and dNK cells. This study identified novel subpopulation specific transcription factors that promote decidualization of stromal cells and dNK mediated maternal immunotolerance. In Article IV, different strategies and methodological frameworks for data integration in single-cell multi-omics data analysis were reviewed in detail. Data integration methods were grouped into early, late and intermediate data integration strategies. The specific stage and order of data integration can have substantial effect on the results of the integrative analysis. The central details of the approaches were presented, and potential future directions were discussed.  Laskennallisia menetelmiä yksisolusekvensointi- ja multiomiikkatulosten analyyseihin Yksisolusekvensointitekniikat mahdollistavat solujen heterogeenisyyden tutkimuksen ennennäkemättömällä resoluutiolla ja uusien solutyyppien löytämisen. Solutyyppien tunnistamisessa keskeisessä roolissa on ryhmittely eli klusterointianalyysi. Myös geenien säätelyverkostojen sekä eri molekyylidatatasojen yhdistäminen on keskeistä analyysissä. Väitöskirjassa verrataan bayesilaisia klusterointimenetelmiä ja yhdistetään eri menetelmillä kerättyjä tietoja kohdun solutyyppispesifisessä geeninsäätelyanalyysissä. Lisäksi yksisolutiedon integraatiomenetelmiä selvitetään kattavasti. Julkaisu I keskittyy analyyttisten menetelmien, erityisesti latenttiin Dirichletallokaatioon (LDA) ja hierarkkiseen Dirichlet-prosessiin (HDP) perustuvien mallien tutkimiseen yksisoludatan klusterianalyysissä. Kattava vertailu näiden kahden mallin sekä olemassa olevien menetelmien kanssa paljasti, että aihemallinnuspohjaiset menetelmät voivat olla hyödyllisiä yksisoludatan klusterianalyysissä. Menetelmien suorituskyky riippui myös kunkin analysoitavan datasetin ominaisuuksista. Julkaisuissa II ja III keskitytään naisen lisääntymisterveydelle tärkeiden kohdun stroomasolujen ja NK-immuunisolujen solutyyppispesifiseen analyysiin. Artikkelissa II yhdistettiin olemassa olevia tuloksia pre-eklampsiasta viimeisimpiin yksisolusekvensointituloksiin ja löydettiin varhain alkavan pre-eklampsian (EOP) ja myöhään alkavan pre-eklampsian (LOP) solutyyppispesifisiä vaikutuksia. Havaittiin, että erilaistuneen strooman markkerigeenien ilmentyminen vähentyi LOP:ssa ja NK-markkerigeenien ilmentyminen lisääntyi EOP:ssa. Julkaisu III analysoi strooman ja NK-solujen alapopulaatiospesifisiä geeninsäätelyverkostoja ja niiden transkriptiofaktoreita. Tutkimus tunnisti uusia alapopulaatiospesifisiä säätelijöitä, jotka edistävät strooman erilaistumista ja NK-soluvälitteistä immunotoleranssia Julkaisu IV tarkastelee yksityiskohtaisesti strategioita ja menetelmiä erilaisten yksisoludatatasojen (multi-omiikka) integroimiseksi. Integrointimenetelmät ryhmiteltiin varhaisen, myöhäisen ja välivaiheen strategioihin ja kunkin lähestymistavan menetelmiä esiteltiin tarkemmin. Lisäksi keskusteltiin mahdollisista tulevaisuuden suunnista

    Novel Approaches to Studying the Effects of Cis-Regulatory Variants in the Central Nervous System

    Get PDF
    For decades, studies of the genetic basis of disease have focused on rare coding mutations that disrupt protein function, leading to the identification of hundreds of genes underlying Mendelian diseases. However, many complex diseases are non-Mendelian, and less than 2% of the genome is coding. It is now clear that non-coding variants contribute to disease susceptibility, but the precise underlying mechanisms are generally unknown. Cis-regulatory elements (CREs) are transcription factor (TF)-bound genomic regions that regulate gene expression, and variants within CREs can therefore modify gene expression. The putative locations of CREs in a variety of cell types have been identified through genome-wide assays of TF binding and epigenomic signatures, providing a starting point for probing the effects of cis-regulatory variants. Unlike coding mutations, which can be interpreted based on the genetic code, the functional consequence of any given cis-regulatory variant is difficult to predict even at the molecular level. Therefore, a major bottleneck lies in interpreting the functional significance of these variants. In the present work, I study the effects of cis-regulatory variants in the central nervous system (CNS), specifically in retina and brain. The retina is composed of well-characterized neuronal cell types and an extensively studied transcriptional network, while the brain is the center of human cognition and a target of devastating neuropsychiatric diseases. First, I take advantage of the genetic diversity between two distantly related mouse strains to describe the relationship between cis-regulatory variants and differences in retinal gene expression. I identify cis- and trans-regulatory effects, as well as parent-of-origin effects. Second, I develop a new technology based on an existing massively parallel reporter assay, CRE-seq, to enable the functional study of long CREs in the CNS in vivo for the first time. I demonstrate the ability of this approach to measure tissue-specific cis-regulatory activity in the brain and to pinpoint DNA bases critical for activity. Finally, I conduct a detailed mechanistic study of a non-coding region containing variants associated with both human cognitive performance and bipolar disorder. This last study illustrates the complexities and challenges of establishing the causal role of non-coding variants in disease
    corecore