Location of Repository

Comparing variant calling algorithms for target-exon sequencing in a large sample

By Yancy Lo, Hyun M Kang, Matthew R Nelson, Mohammad I Othman, Stephanie L Chissoe, Margaret G Ehm, Gonçalo R Abecasis and Sebastian Zöllner


Abstract Background Sequencing studies of exonic regions aim to identify rare variants contributing to complex traits. With high coverage and large sample size, these studies tend to apply simple variant calling algorithms. However, coverage is often heterogeneous; sites with insufficient coverage may benefit from sophisticated calling algorithms used in low-coverage sequencing studies. We evaluate the potential benefits of different calling strategies by performing a comparative analysis of variant calling methods on exonic data from 202 genes sequenced at 24x in 7,842 individuals. We call variants using individual-based, population-based and linkage disequilibrium (LD)-aware methods with stringent quality control. We measure genotype accuracy by the concordance with on-target GWAS genotypes and between 80 pairs of sequencing replicates. We validate selected singleton variants using capillary sequencing. Results Using these calling methods, we detected over 27,500 variants at the targeted exons; >57% were singletons. The singletons identified by individual-based analyses were of the highest quality. However, individual-based analyses generated more missing genotypes (4.72%) than population-based (0.47%) and LD-aware (0.17%) analyses. Moreover, individual-based genotypes were the least concordant with array-based genotypes and replicates. Population-based genotypes were less concordant than genotypes from LD-aware analyses with extended haplotypes. We reanalyzed the same dataset with a second set of callers and showed again that the individual-based caller identified more high-quality singletons than the population-based caller. We also replicated this result in a second dataset of 57 genes sequenced at 127.5x in 3,124 individuals. Conclusions We recommend population-based analyses for high quality variant calls with few missing genotypes. With extended haplotypes, LD-aware methods generate the most accurate and complete genotypes. In addition, individual-based analyses should complement the above methods to obtain the most singleton variants

Year: 2015
DOI identifier: 10.1186/s12859-015-0489-0
OAI identifier: oai:deepblue.lib.umich.edu:2027.42/110906

Suggested articles


To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.