Recently-developed genotype imputation methods are a powerful tool for
detecting untyped genetic variants that affect disease susceptibility in
genetic association studies. However, existing imputation methods require
individual-level genotype data, whereas, in practice, it is often the case that
only summary data are available. For example, this may occur because, for
reasons of privacy or politics, only summary data are made available to the
research community at large; or because only summary data are collected, as in
DNA pooling experiments. In this article we introduce a new statistical method
that can accurately infer the frequencies of untyped genetic variants in these
settings, and indeed substantially improve frequency estimates at typed
variants in pooling experiments where observations are noisy. Our approach,
which predicts each allele frequency using a linear combination of observed
frequencies, is statistically straightforward, and related to a long history of
the use of linear methods for estimating missing values (e.g., Kriging). The
main statistical novelty is our approach to regularizing the covariance matrix
estimates, and the resulting linear predictors, which is based on methods from
population genetics. We find that, besides being both fast and
flexible---allowing new problems to be tackled that cannot be handled by
existing imputation approaches purpose-built for the genetic context---these
linear methods are also very accurate. Indeed, imputation accuracy using this
approach is similar to that obtained by state-of-the-art imputation methods
that use individual-level data, but at a fraction of the computational cost.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS338 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org