Statistical inference from molecular population genetic data is currently a very active
area of research for two main reasons. First, in the past two decades an enormous
amount of molecular genetic data have been produced and the amount of data is
expected to grow even more in the future. Second, drawing inferences about complex
population genetics problems, for example understanding the demographic and genetic
factors that shaped modern populations, poses a serious statistical challenge.
Amongst the many different kinds of genetic data that have appeared in the past
two decades, the highly polymorphic microsatellites have played an important role.
Microsatellites revolutionized the population genetics of natural populations, and were
the initial tool for linkage mapping in humans and other model organisms. Despite
their important role, and extensive use, the evolutionary dynamics of microsatellites
are still not fully understood, and their statistical methods are often underdeveloped
and do not adequately model microsatellite evolution. In this thesis, I address some
aspects of this problem by assessing the performance of existing statistical tools, and
developing some new ones. My work encompasses a range of statistical methods from
simple hypothesis testing to more recent, complex computational statistical tools. This
thesis consists of four main topics.
First, I review the statistical methods that have been developed for microsatellites
in population genetics applications. I review the different models of the microsatellite
mutation process, and ask which models are the most supported by data, and how
models were incorporated into statistical methods. I also present estimates of mutation
parameters for several species based on published data.
Second, I evaluate the performance of estimators of genetic relatedness using real
data from five vertebrate populations. I demonstrate that the overall performance
of marker-based pairwise relatedness estimators mainly depends on the population
relatedness composition and may only be improved by the marker data quality within
the limits of the population relatedness composition.
Third, I investigate the different null hypotheses that may be used to test for
independence between loci. Using simulations I show that testing for statistical
independence (i.e. zero linkage disequilibrium, LD) is difficult to interpret in
most cases, and instead a null hypothesis should be tested, which accounts for the
“background LD” due to finite population size. I investigate the utility of a novel
approximate testing procedure to circumvent this problem, and illustrate its use on a
real data set from red deer.
Fourth, I explore the utility of Approximate Bayesian Computation, inference
based on summary statistics, to estimate demographic parameters from admixed
populations. Assuming a simple demographic model, I show that the choice of
summary statistics greatly influences the quality of the estimation, and that different
parameters are better estimated with different summary statistics. Most importantly, I
show how the estimation of most admixture parameters can be considerably improved
via the use of linkage disequilibrium statistics from microsatellite data