25,687 research outputs found

    Identifying Copy Number Variations based on Next Generation Sequencing Data by a Mixture of Poisson Model

    Get PDF
    Next generation sequencing (NGS) technologies have profoundly impacted biological research and are becoming more and more popular due to cost effectiveness and their speed. NGS can be utilized to identify DNA structural variants, namely copy number variations (CNVs) which showed association with diseases like HIV, diabetes II, or cancer.

There have been first approaches to detect CNVs in NGS data, where most of them detect a CNV by a significant difference of read counts within neighboring windows at the chromosome. However these methods suffer from systematical variations of the underlying read count distributions along the chromosome due to biological and technical noise. In contrast to these global methods, we locally model the read count distribution characteristics by a mixture of Poissons which allows to incorporate a linear dependence between copy numbers and read counts. Model selection is performed in a Bayesian framework by maximizing the posterior through an EM algorithm. We define a CNV call which indicates a deviation of the Poisson mixture parameters from the null hypothesis represented by the prior which is a model for constant copy number across the samples. A CNV call requires sufficient information in the data to push the model away from the null hypothesis given by the prior.

We test our approach on the HapMap cohort where we rediscovered previously found CNVs which validates our approach. It is then tested on the tumor genome data set where we are able to considerably increase the detection while reducing the false discoveries.
&#xa

    A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection

    Get PDF
    Abstract Background The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. Typically, the genome is divided into tiling windows and windowed read-count data is generated for the entire genome from which genomic signals are detected (e.g. copy number changes in DNA-seq, enrichment peaks in ChIP-seq). For accurate analysis of read-count data, many state-of-the-art statistical methods use generalized linear models (GLM) coupled with the negative-binomial (NB) distribution by leveraging its ability for simultaneous bias correction and signal detection. However, although statistically powerful, the GLM+NB method has a quadratic computational complexity and therefore suffers from slow running time when applied to large-scale windowed read-count data. In this study, we aimed to speed up substantially the GLM+NB method by using a randomized algorithm and we demonstrate here the utility of our approach in the application of detecting copy number variants (CNVs) using a real example. Results We propose an efficient estimator, the randomized GLM+NB coefficients estimator (RGE), for speeding up the GLM+NB method. RGE samples the read-count data and solves the estimation problem on a smaller scale. We first theoretically validated the consistency and the variance properties of RGE. We then applied RGE to GENSENG, a GLM+NB based method for detecting CNVs. We named the resulting method as “R-GENSENG". Based on extensive evaluation using both simulated and empirical data, we concluded that R-GENSENG is ten times faster than the original GENSENG while maintaining GENSENG’s accuracy in CNV detection. Conclusions Our results suggest that RGE strategy developed here could be applied to other GLM+NB based read-count analyses, i.e. ChIP-seq data analysis, to substantially improve their computational efficiency while preserving the analytic power

    Change-point model on nonhomogeneous Poisson processes with application in copy number profiling by next-generation DNA sequencing

    Get PDF
    We propose a flexible change-point model for inhomogeneous Poisson Processes, which arise naturally from next-generation DNA sequencing, and derive score and generalized likelihood statistics for shifts in intensity functions. We construct a modified Bayesian information criterion (mBIC) to guide model selection, and point-wise approximate Bayesian confidence intervals for assessing the confidence in the segmentation. The model is applied to DNA Copy Number profiling with sequencing data and evaluated on simulated spike-in and real data sets.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS517 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Rapid detection of copy number variations and point mutations in BRCA1/2 genes using a single workflow by ion semiconductor sequencing pipeline

    Get PDF
    Molecular analysis of BRCA1 (MIM# 604370) and BRCA2 (MIM #600185) genes is essential for familial breast and ovarian cancer prevention and treatment. An efficient, rapid, cost-effective accurate strategy for the detection of pathogenic variants is crucial. Mutations detection of BRCA1/2 genes includes screening for single nucleotide variants (SNVs), small insertions or deletions (indels), and Copy Number Variations (CNVs). Sanger sequencing is unable to identify CNVs and therefore Multiplex Ligation Probe amplification (MLPA) or Multiplex Amplicon Quantification (MAQ) is used to complete the BRCA1/2 genes analysis. The rapid evolution of Next Generation Sequencing (NGS) technologies allows the search for point mutations and CNVs with a single platform and workflow. In this study we test the possibilities of NGS technology to simultaneously detect point mutations and CNVs in BRCA1/2 genes, using the OncomineTM BRCA Research Assay on Personal Genome Machine (PGM) Platform with Ion Reporter Software for sequencing data analysis (Thermo Fisher Scientific). Comparison between the NGS-CNVs, MLPA and MAQ results shows how the NGS approach is the most complete and fast method for the simultaneous detection of all BRCA mutations, avoiding the usual time consuming multistep approach in the routine diagnostic testing of hereditary breast and ovarian cancers

    Multi-platform discovery of haplotype-resolved structural variation in human genomes

    Get PDF

    The impact of sequencing depth on the inferred taxonomic composition and AMR gene content of metagenomic samples

    Get PDF
    Shotgun metagenomics is increasingly used to characterise microbial communities, particularly for the investigation of antimicrobial resistance (AMR) in different animal and environmental contexts. There are many different approaches for inferring the taxonomic composition and AMR gene content of complex community samples from shotgun metagenomic data, but there has been little work establishing the optimum sequencing depth, data processing and analysis methods for these samples. In this study we used shotgun metagenomics and sequencing of cultured isolates from the same samples to address these issues. We sampled three potential environmental AMR gene reservoirs (pig caeca, river sediment, effluent) and sequenced samples with shotgun metagenomics at high depth (~ 200 million reads per sample). Alongside this, we cultured single-colony isolates of Enterobacteriaceae from the same samples and used hybrid sequencing (short- and long-reads) to create high- quality assemblies for comparison to the metagenomic data. To automate data processing, we developed an open- source software pipeline, ‘ResPipe’

    InPhaDel: integrative shotgun and proximity-ligation sequencing to phase deletions with single nucleotide polymorphisms.

    Get PDF
    Phasing of single nucleotide (SNV), and structural variations into chromosome-wide haplotypes in humans has been challenging, and required either trio sequencing or restricting phasing to population-based haplotypes. Selvaraj et al demonstrated single individual SNV phasing is possible with proximity ligated (HiC) sequencing. Here, we demonstrate HiC can phase structural variants into phased scaffolds of SNVs. Since HiC data is noisy, and SV calling is challenging, we applied a range of supervised classification techniques, including Support Vector Machines and Random Forest, to phase deletions. Our approach was demonstrated on deletion calls and phasings on the NA12878 human genome. We used three NA12878 chromosomes and simulated chromosomes to train model parameters. The remaining NA12878 chromosomes withheld from training were used to evaluate phasing accuracy. Random Forest had the highest accuracy and correctly phased 86% of the deletions with allele-specific read evidence. Allele-specific read evidence was found for 76% of the deletions. HiC provides significant read evidence for accurately phasing 33% of the deletions. Also, eight of eight top ranked deletions phased by only HiC were validated using long range polymerase chain reaction and Sanger. Thus, deletions from a single individual can be accurately phased using a combination of shotgun and proximity ligation sequencing. InPhaDel software is available at: http://l337x911.github.io/inphadel/
    corecore