45 research outputs found

    MODEL-BASED QUALITY ASSESSMENT AND BASE-CALLING FOR SECOND-GENERATION SEQUENCING DATA

    Get PDF
    Second-generation sequencing (sec-gen) technology can sequence millions of short fragments of DNA in parallel, and is capable of assembling complex genomes for a small fraction of the price and time of previous technologies. In fact, a recently formed international consortium, the 1,000 Genomes Project, plans to fully sequence the genomes of approximately 1,200 people. The prospect of comparative analysis at the sequence level of a large number of samples across multiple populations may be achieved within the next five years. These data present unprecedented challenges in statistical analysis. For instance, analysis operates on millions of short nucleotide sequences, or reads—strings of A,C,G, or T’s, between 30-100 characters long—which are the result of complex processing of noisy continuous fluorescence intensity measurements known as base-calling. The complexity of the base-calling discretization process results in reads of widely varying quality within and across sequence samples. This variation in processing quality results in infrequent but systematic errors that we have found to mislead downstream analysis of the discretized sequence read data. For instance, a central goal of the 1000 Genomes Project is to quantify across-sample variation at the single nucleotide level. At this resolution, small error rates in sequencing prove significant, especially for rare variants. Sec-gen sequencing is a relatively new technology for which potential biases and sources of obscuring variation are not yet fully understood. Therefore, modeling and quantifying the uncertainty inherent in the generation of sequence reads is of utmost importance. In this paper we present a simple model to capture uncertainty arising in the base-calling procedure of the Illumina/Solexa GA platform. Model parameters have a straightforward interpretation in terms of the chemistry of base-calling allowing for informative and easily interpretable metrics that capture the variability in sequencing quality. Our model provides these informative estimates readily usable in quality assessment tools while significantly improving base-calling performance

    Overcoming bias and systematic errors in next generation sequencing data

    Get PDF
    Considerable time and effort has been spent in developing analysis and quality assessment methods to allow the use of microarrays in a clinical setting. As is the case for microarrays and other high-throughput technologies, data from new high-throughput sequencing technologies are subject to technological and biological biases and systematic errors that can impact downstream analyses. Only when these issues can be readily identified and reliably adjusted for will clinical applications of these new technologies be feasible. Although much work remains to be done in this area, we describe consistently observed biases that should be taken into account when analyzing high-throughput sequencing data. In this article, we review current knowledge about these biases, discuss their impact on analysis results, and propose solutions

    DNA Methylation Patterns in Cord Blood of Neonates Across Gestational Age Association With Cell-Type Proportions

    Get PDF
    Background: A statistical methodology is available to estimate the proportion of cell types (cellular heterogeneity) in adult whole blood specimens used in epigenome-wide association studies (EWAS). However, there is no methodology to estimate the proportion of cell types in umbilical cord blood (also a heterogeneous tissue) used in EWAS. Objectives: The objectives of this study were to determine whether differences in DNA methylation (DNAm) patterns in umbilical cord blood are the result of blood cell type proportion changes that typically occur across gestational age and to demonstrate the effect of cell type proportion confounding by comparing preterm infants exposed and not exposed to antenatal steroids. Methods: We obtained DNAm profiles of cord blood using the Illumina HumanMethylation27k BeadChip array for 385 neonates from the Boston Birth Cohort. We estimated cell type proportions for six cell types using the deconvolution method developed by Houseman et al. (2012). Results: The cell type proportion estimates segregated into two groups that were significantly different by gestational age, indicating that gestational age was associated with cell type proportion. Among infants exposed to antenatal steroids, the number of differentially methylated CpGs dropped from 127 to 1 after controlling for cell type proportion. Discussion: EWAS utilizing cord blood are confounded by cell type proportion. Careful study design including correction for cell type proportion and interpretation of results of EWAS using cord blood are critical

    Epiviz Web Components: reusable and extensible component library to visualize functional genomic datasets [version 1; referees: 1 approved, 2 approved with reservations]

    Get PDF
    Interactive and integrative data visualization tools and libraries are integral to exploration and analysis of genomic data. Web based genome browsers allow integrative data exploration of a large number of data sets for a specific region in the genome. Currently available web-based genome browsers are developed for specific use cases and datasets, therefore integration and extensibility of the visualizations and the underlying libraries from these tools is a challenging task. Genomic data visualization and software libraries that enable bioinformatic researchers and developers to implement customized genomic data viewers and data analyses for their application are much needed. Using recent advances in core web platform APIs and technologies including Web Components, we developed the Epiviz Component Library, a reusable and extensible data visualization library and application framework for genomic data. Epiviz Components can be integrated with most JavaScript libraries and frameworks designed for HTML. To demonstrate the ease of integration with other frameworks, we developed an R/Bioconductor epivizrChart package, that provides interactive, shareable and reproducible visualizations of genomic data objects in R, Shiny and also create standalone HTML documents. The component library is modular by design, reusable and natively extensible and therefore simplifies the process of managing and developing bioinformatic applications

    The partitioned LASSO-patternsearch algorithm with application to gene expression data

    Get PDF
    In systems biology, the task of reverse engineering gene pathways from data has been limited not just by the curse of dimensionality (the interaction space is huge) but also by systematic error in the data. The gene expression barcode reduces spurious association driven by batch effects and probe effects. The binary nature of the resulting expression calls lends itself perfectly to modern regularization approaches that thrive in high-dimensional settings. The Partitioned LASSO-Patternsearch algorithm is proposed to identify patterns of multiple dichotomous risk factors for outcomes of interest in genomic studies. A partitioning scheme is used to identify promising patterns by solving many LASSO-Patternsearch subproblems in parallel. All variables that survive this stage proceed to an aggregation stage where the most significant patterns are identified by solving a reduced LASSO-Patternsearch problem in just these variables. This approach was applied to genetic data sets with expression levels dichotomized by gene expression bar code. Most of the genes and second-order interactions thus selected and are known to be related to the outcomes. We demonstrate with simulations and data analyses that the proposed method not only selects variables and patterns more accurately, but also provides smaller models with better prediction accuracy, in comparison to several alternative methodologies.https://doi.org/10.1186/1471-2105-13-9

    Large hypomethylated blocks as a universal defining epigenetic alteration in human solid tumors

    Get PDF
    Background: One of the most provocative recent observations in cancer epigenetics is the discovery of large hypomethylated blocks, including single copy genes, in colorectal cancer, that correspond in location to heterochromatic LOCKs (large organized chromatin lysine-modifications) and LADs (lamin-associated domains). Methods: Here we performed a comprehensive genome-scale analysis of 10 breast, 28 colon, nine lung, 38 thyroid, 18 pancreas cancers, and five pancreas neuroendocrine tumors as well as matched normal tissue from most of these cases, as well as 51 premalignant lesions. We used a new statistical approach that allows the identification of large hypomethylated blocks on the Illumina HumanMethylation450 BeadChip platform. Results: We find that hypomethylated blocks are a universal feature of common solid human cancer, and that they occur at the earliest stage of premalignant tumors and progress through clinical stages of thyroid and colon cancer development. We also find that the disrupted CpG islands widely reported previously, including hypermethylated island bodies and hypomethylated shores, are enriched in hypomethylated blocks, with flattening of the methylation signal within and flanking the islands. Finally, we found that genes showing higher between individual gene expression variability are enriched within these hypomethylated blocks. Conclusion: Thus hypomethylated blocks appear to be a universal defining epigenetic alteration in human cancer, at least for common solid tumors. Electronic supplementary material The online version of this article (doi:10.1186/s13073-014-0061-y) contains supplementary material, which is available to authorized users

    A framework for assessing 16S rRNA marker-gene survey data analysis methods using mixtures.

    Get PDF
    There are a variety of bioinformatic pipelines and downstream analysis methods for analyzing 16S rRNA marker-gene surveys. However, appropriate assessment datasets and metrics are needed as there is limited guidance to decide between available analysis methods. Mixtures of environmental samples are useful for assessing analysis methods as one can evaluate methods based on calculated expected values using unmixed sample measurements and the mixture design. Previous studies have used mixtures of environmental samples to assess other sequencing methods such as RNAseq. But no studies have used mixtures of environmental to assess 16S rRNA sequencing. We developed a framework for assessing 16S rRNA sequencing analysis methods which utilizes a novel two-sample titration mixture dataset and metrics to evaluate qualitative and quantitative characteristics of count tables. Our qualitative assessment evaluates feature presence/absence exploiting features only present in unmixed samples or titrations by testing if random sampling can account for their observed relative abundance. Our quantitative assessment evaluates feature relative and differential abundance by comparing observed and expected values. We demonstrated the framework by evaluating count tables generated with three commonly used bioinformatic pipelines: (i) DADA2 a sequence inference method, (ii) Mothur a de novo clustering method, and (iii) QIIME an open-reference clustering method. The qualitative assessment results indicated that the majority of Mothur and QIIME features only present in unmixed samples or titrations were accounted for by random sampling alone, but this was not the case for DADA2 features. Combined with count table sparsity (proportion of zero-valued cells in a count table), these results indicate DADA2 has a higher false-negative rate whereas Mothur and QIIME have higher false-positive rates. The quantitative assessment results indicated the observed relative abundance and differential abundance values were consistent with expected values for all three pipelines. We developed a novel framework for assessing 16S rRNA marker-gene survey methods and demonstrated the framework by evaluating count tables generated with three bioinformatic pipelines. This framework is a valuable community resource for assessing 16S rRNA marker-gene survey bioinformatic methods and will help scientists identify appropriate analysis methods for their marker-gene surveys.https://doi.org/10.1186/s40168-020-00812-

    Multivariable association discovery in population-scale meta-omics studies.

    Get PDF
    It is challenging to associate features such as human health outcomes, diet, environmental conditions, or other metadata to microbial community measurements, due in part to their quantitative properties. Microbiome multi-omics are typically noisy, sparse (zero-inflated), high-dimensional, extremely non-normal, and often in the form of count or compositional measurements. Here we introduce an optimized combination of novel and established methodology to assess multivariable association of microbial community features with complex metadata in population-scale observational studies. Our approach, MaAsLin 2 (Microbiome Multivariable Associations with Linear Models), uses generalized linear and mixed models to accommodate a wide variety of modern epidemiological studies, including cross-sectional and longitudinal designs, as well as a variety of data types (e.g., counts and relative abundances) with or without covariates and repeated measurements. To construct this method, we conducted a large-scale evaluation of a broad range of scenarios under which straightforward identification of meta-omics associations can be challenging. These simulation studies reveal that MaAsLin 2\u27s linear model preserves statistical power in the presence of repeated measures and multiple covariates, while accounting for the nuances of meta-omics features and controlling false discovery. We also applied MaAsLin 2 to a microbial multi-omics dataset from the Integrative Human Microbiome (HMP2) project which, in addition to reproducing established results, revealed a unique, integrated landscape of inflammatory bowel diseases (IBD) across multiple time points and omics profiles
    corecore