Whole and targeted sequencing of human genomes is a promising, increasingly
feasible tool for discovering genetic contributions to risk of complex
diseases. A key step is calling an individual's genotype from the multiple
aligned short read sequences of his DNA, each of which is subject to nucleotide
read error. Current methods are designed to call genotypes separately at each
locus from the sequence data of unrelated individuals. Here we propose
likelihood-based methods that improve calling accuracy by exploiting two
features of sequence data. The first is the linkage disequilibrium (LD) between
nearby SNPs. The second is the Mendelian pedigree information available when
related individuals are sequenced. In both cases the likelihood involves the
probabilities of read variant counts given genotypes, summed over the
unobserved genotypes. Parameters governing the prior genotype distribution and
the read error rates can be estimated either from the sequence data itself or
from external reference data. We use simulations and synthetic read data based
on the 1000 Genomes Project to evaluate the performance of the proposed
methods. An R-program to apply the methods to small families is freely
available at http://med.stanford.edu/epidemiology/PHGC/.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS527 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org