4 research outputs found

    Comparing variant calling algorithms for target-exon sequencing in a large sample

    Get PDF
    Abstract Background Sequencing studies of exonic regions aim to identify rare variants contributing to complex traits. With high coverage and large sample size, these studies tend to apply simple variant calling algorithms. However, coverage is often heterogeneous; sites with insufficient coverage may benefit from sophisticated calling algorithms used in low-coverage sequencing studies. We evaluate the potential benefits of different calling strategies by performing a comparative analysis of variant calling methods on exonic data from 202 genes sequenced at 24x in 7,842 individuals. We call variants using individual-based, population-based and linkage disequilibrium (LD)-aware methods with stringent quality control. We measure genotype accuracy by the concordance with on-target GWAS genotypes and between 80 pairs of sequencing replicates. We validate selected singleton variants using capillary sequencing. Results Using these calling methods, we detected over 27,500 variants at the targeted exons; >57% were singletons. The singletons identified by individual-based analyses were of the highest quality. However, individual-based analyses generated more missing genotypes (4.72%) than population-based (0.47%) and LD-aware (0.17%) analyses. Moreover, individual-based genotypes were the least concordant with array-based genotypes and replicates. Population-based genotypes were less concordant than genotypes from LD-aware analyses with extended haplotypes. We reanalyzed the same dataset with a second set of callers and showed again that the individual-based caller identified more high-quality singletons than the population-based caller. We also replicated this result in a second dataset of 57 genes sequenced at 127.5x in 3,124 individuals. Conclusions We recommend population-based analyses for high quality variant calls with few missing genotypes. With extended haplotypes, LD-aware methods generate the most accurate and complete genotypes. In addition, individual-based analyses should complement the above methods to obtain the most singleton variants.http://deepblue.lib.umich.edu/bitstream/2027.42/110906/1/12859_2015_Article_489.pd

    Comparing variant calling algorithms for target-exon sequencing in a large sample

    Full text link
    Abstract Background Sequencing studies of exonic regions aim to identify rare variants contributing to complex traits. With high coverage and large sample size, these studies tend to apply simple variant calling algorithms. However, coverage is often heterogeneous; sites with insufficient coverage may benefit from sophisticated calling algorithms used in low-coverage sequencing studies. We evaluate the potential benefits of different calling strategies by performing a comparative analysis of variant calling methods on exonic data from 202 genes sequenced at 24x in 7,842 individuals. We call variants using individual-based, population-based and linkage disequilibrium (LD)-aware methods with stringent quality control. We measure genotype accuracy by the concordance with on-target GWAS genotypes and between 80 pairs of sequencing replicates. We validate selected singleton variants using capillary sequencing. Results Using these calling methods, we detected over 27,500 variants at the targeted exons; >57% were singletons. The singletons identified by individual-based analyses were of the highest quality. However, individual-based analyses generated more missing genotypes (4.72%) than population-based (0.47%) and LD-aware (0.17%) analyses. Moreover, individual-based genotypes were the least concordant with array-based genotypes and replicates. Population-based genotypes were less concordant than genotypes from LD-aware analyses with extended haplotypes. We reanalyzed the same dataset with a second set of callers and showed again that the individual-based caller identified more high-quality singletons than the population-based caller. We also replicated this result in a second dataset of 57 genes sequenced at 127.5x in 3,124 individuals. Conclusions We recommend population-based analyses for high quality variant calls with few missing genotypes. With extended haplotypes, LD-aware methods generate the most accurate and complete genotypes. In addition, individual-based analyses should complement the above methods to obtain the most singleton variants.http://deepblue.lib.umich.edu/bitstream/2027.42/134735/1/12859_2015_Article_489.pd

    Statistical Methods, Analyses and Applications for Next-Generation Sequencing Studies.

    Full text link
    Current genetics studies rely heavily on next-generation sequencing (NGS) techniques. This dissertation addresses methodological developments and statistical strategies to efficiently and accurately analyze the large amounts of NGS data, thereby to understand the genetic contributions to diseases. In chapter 2, we evaluated the benefits of different variant calling strategies by performing a comparative analysis of calling methods on large-scale exonic sequencing datasets. We found that individual-based analyses identified the most high quality singletons, but had lower genotype accuracy at common variants than population-based and LD-aware analyses. Therefore, we recommend population-based analyses for high quality variant calls with few missing genotypes, complemented by individual-based analyses to obtain the most singleton variants. In chapters 3 and 4, we addressed the issue of overlapping read pairs in NGS studies arising from short fragments. In chapter 3, we proposed novel models to separately estimate machine and fragment errors of a NGS experiment from overlapping read pairs. Using a Markov chain Monte Carlo algorithm, our models suggested that machine and fragment errors were largely predicted by the reported quality scores of the overlapping bases and were uniform across individual samples from the same experiment. In chapter 4, we proposed an algorithm, RESCORE, to resolve the fragment dependence while retaining machine error estimates in overlapping reads. When compared to soft-clipping the overlapping regions, RESCORE increased the recalibrated base quality scores for the majority of overlapping bases, leading to a decrease in estimated false positive rate of novel variant discovery. In chapter 5, we presented an application of whole-genome sequencing for understanding the evolutionary history of uropathogenic Escherichia coli (UPEC). We sequenced 14 UPEC and 5 commensals at >190x, and found a deep split between UPEC and commensal E. coli. We observed high between-strain diversity, which suggests multiple origins of pathogenicity. We detected no selective advantage of virulence genes over other genomic regions. These results suggest that UPEC acquired uropathogenicity a long time ago and used it opportunistically to cause extraintestinal infections. In summary, this dissertation presented practical strategies for NGS studies that will contribute to further genetic advances.PhDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/116761/1/yancylo_1.pd
    corecore