Search CORE

4 research outputs found

Comparing variant calling algorithms for target-exon sequencing in a large sample

Author: Abecasis Gonçalo R
Chissoe Stephanie L
Ehm Margaret G
Kang Hyun M
Lo Yancy
Nelson Matthew R
Othman Mohammad I
Zöllner Sebastian
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Abstract Background Sequencing studies of exonic regions aim to identify rare variants contributing to complex traits. With high coverage and large sample size, these studies tend to apply simple variant calling algorithms. However, coverage is often heterogeneous; sites with insufficient coverage may benefit from sophisticated calling algorithms used in low-coverage sequencing studies. We evaluate the potential benefits of different calling strategies by performing a comparative analysis of variant calling methods on exonic data from 202 genes sequenced at 24x in 7,842 individuals. We call variants using individual-based, population-based and linkage disequilibrium (LD)-aware methods with stringent quality control. We measure genotype accuracy by the concordance with on-target GWAS genotypes and between 80 pairs of sequencing replicates. We validate selected singleton variants using capillary sequencing. Results Using these calling methods, we detected over 27,500 variants at the targeted exons; >57% were singletons. The singletons identified by individual-based analyses were of the highest quality. However, individual-based analyses generated more missing genotypes (4.72%) than population-based (0.47%) and LD-aware (0.17%) analyses. Moreover, individual-based genotypes were the least concordant with array-based genotypes and replicates. Population-based genotypes were less concordant than genotypes from LD-aware analyses with extended haplotypes. We reanalyzed the same dataset with a second set of callers and showed again that the individual-based caller identified more high-quality singletons than the population-based caller. We also replicated this result in a second dataset of 57 genes sequenced at 127.5x in 3,124 individuals. Conclusions We recommend population-based analyses for high quality variant calls with few missing genotypes. With extended haplotypes, LD-aware methods generate the most accurate and complete genotypes. In addition, individual-based analyses should complement the above methods to obtain the most singleton variants.http://deepblue.lib.umich.edu/bitstream/2027.42/110906/1/12859_2015_Article_489.pd

Crossref

Springer - Publisher Connector

PubMed Central

Deep Blue Documents at the University of Michigan

Comparing variant calling algorithms for target-exon sequencing in a large sample

Author: Abecasis Gonçalo R
Chissoe Stephanie L
Ehm Margaret G
Kang Hyun M
Lo Yancy
Nelson Matthew R
Othman Mohammad I
Zöllner Sebastian
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 03/01/2017
Field of study

Deep Blue Documents at the University of Michigan

Statistical Methods, Analyses and Applications for Next-Generation Sequencing Studies.

Author: Lo Yan Yancy
Publication venue
Publication date: 01/01/2015
Field of study

Current genetics studies rely heavily on next-generation sequencing (NGS) techniques. This dissertation addresses methodological developments and statistical strategies to efficiently and accurately analyze the large amounts of NGS data, thereby to understand the genetic contributions to diseases. In chapter 2, we evaluated the benefits of different variant calling strategies by performing a comparative analysis of calling methods on large-scale exonic sequencing datasets. We found that individual-based analyses identified the most high quality singletons, but had lower genotype accuracy at common variants than population-based and LD-aware analyses. Therefore, we recommend population-based analyses for high quality variant calls with few missing genotypes, complemented by individual-based analyses to obtain the most singleton variants. In chapters 3 and 4, we addressed the issue of overlapping read pairs in NGS studies arising from short fragments. In chapter 3, we proposed novel models to separately estimate machine and fragment errors of a NGS experiment from overlapping read pairs. Using a Markov chain Monte Carlo algorithm, our models suggested that machine and fragment errors were largely predicted by the reported quality scores of the overlapping bases and were uniform across individual samples from the same experiment. In chapter 4, we proposed an algorithm, RESCORE, to resolve the fragment dependence while retaining machine error estimates in overlapping reads. When compared to soft-clipping the overlapping regions, RESCORE increased the recalibrated base quality scores for the majority of overlapping bases, leading to a decrease in estimated false positive rate of novel variant discovery. In chapter 5, we presented an application of whole-genome sequencing for understanding the evolutionary history of uropathogenic Escherichia coli (UPEC). We sequenced 14 UPEC and 5 commensals at >190x, and found a deep split between UPEC and commensal E. coli. We observed high between-strain diversity, which suggests multiple origins of pathogenicity. We detected no selective advantage of virulence genes over other genomic regions. These results suggest that UPEC acquired uropathogenicity a long time ago and used it opportunistically to cause extraintestinal infections. In summary, this dissertation presented practical strategies for NGS studies that will contribute to further genetic advances.PhDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/116761/1/yancylo_1.pd

Deep Blue Documents at the University of Michigan

Comparing variant calling algorithms for target-exon sequencing in a large sample

Author: A Hodgkinson
A McKenna
BL Browning
C Huebner
DR Bentley
G Curocichin
GA Watterson
Gonçalo R Abecasis
GT Marth
H Li
H Li
H Li
Hyun M Kang
J Majewski
J Marchini
J Terr
JA Tennessen
L Mamanova
M Choi
M Nelson
MA DePristo
Margaret G Ehm
Matthew R Nelson
MJ Bamshad
Mohammad I Othman
R Li
R Li
R Nielsen
S Purcell
SB Ng
SB Ng
Sebastian Zöllner
SQ Le
SR Browning
Stephanie L Chissoe
The 1000 Genomes Project Consortium
The 1000 Genomes Project Consortium
VM Schaibley
X Liu
X Zhan
Y Li
Y Li
Y Wang
Yancy Lo
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref