5 research outputs found

    Statistical Methods for Identifying Demographic Structure in DNA Sequence Alignments

    Get PDF
    All life on Earth, from viruses and bacteria, trees and flowers, to birds and human beings, can be traced back to a single common ancestor. However, the evolutionary history that led to this diversity of life is a complicated story that we do not yet fully understand. Since the discovery of the structure of deoxyribonucleic acid (DNA) in 1953, and the development of DNA sequencing technology, researchers have been using similarities and differences in the genomes of organisms to better understand the relationships between species. However, due to the complexity of the evolutionary history of life, simplifying assumptions must be made to make mathematical models tractable. It must then be of paramount importance for researchers to be able to identify when the simplifying assumptions of a specific model are unreasonable. In this thesis we present two projects, and although they are different in implementation, both attempt to investigate simplifying assumptions in the closely related fields of population genetics and phylogenetics. However, we also present applications of our projects where the results of our work are not used in assessing assumptions for further analyses, but are of standalone interest to researchers. Our first project is concerned with the development of a method for constructing coordinate representations for single-copy DNA, such as mitochondrial DNA (mtDNA) or Y-chromosomal DNA, analogous to the use of PCA for nuclear DNA. We construct a coordinate system such that, given p informative sites in an alignment of n individuals, returns p-dimensional coordinates for each n individuals. We order the dimensions by the proportion of variability each dimension captures in the overall genetic diversity. From these coordinates in \genetic space" researchers may perform a number of down stream analyses. It is possible to optimally visualise high-dimensional sequence data in two or three dimensions. One may use our method to identify closely related individuals, identify sites in the alignment that are closely linked, or to use the same coordinate space to nd sites that are closely linked with groups of individuals. Finally, one may choose to test for significant relationships between the structure of the coordinates in genetic space, and metadata recorded on sequenced individuals, indicating demographic variables that are highly related to the evolutionary history of an alignment. This final application of our method, where one may test for demographic structure in sequence data, is of key importance to the theme of discovering when simplifying assumptions of analyses are not reasonable. Through the comparison of coordinates in gene space, and any demographic variables of interest, researchers may explore whether or not the individuals in the alignment indicate population substructure. For example, one may investigate if there appears to be a phylogeographic structure to the individuals forming distinct subpopulations, and if migration appears to occur between subpopulations. Through empirical data, we show that our method can readily recover tree-like structure, identify strong genetic groupings based on qualitative traits and show that we are able to recover phylogeographic signal given provenanced sampling information. We show that our method can even be used to suggest routes of migration based on mtDNA. Finally we apply our method to modern Aboriginal Australian mtDNA to show strong evidence for discrete geographic populations of Aboriginal Australian peoples that display permanence on the Australian landscape dating back to the original colonisation of Australia 50 thousand years before present (kya). Our second project is concerned with identifying departures from a tree-like evolutionary history at the species level. It is not uncommon for closely related species (Species A and C say) to still be capable of interbreeding, and producing viable \hybrid" offγspring (Species B say). Under these conditions, a phylogenetic tree cannot describe the evolutionary history of the hybrid species, and instead an admixture graph may be a better description. We begin by considering the evolutionary history of three species: a hybrid organism that has undergone some independent evolution (Species B), and two \parent" organisms, Species A and C. Relatively long, contiguous regions of the genome of Species B will have undergone no recombination since the admixture event. These regions will have been contributed by either Species A (and hence will be more closely related to Species A), or Species C. We aim to estimate the proportion of the genome contributed by Species A, and denote this by considering the proportion of informative site patterns that indicate evidence for the two possible ancestries. The mixing proportion is the parameter of interest in our analyses. However, due to the classical problem of the non-identifiability of mixing parameters in multinomial distributions, we describe two Bayesian methods for estimating γ. Our first method places prior distributions on the parameters of the model, and uses Approximate Bayesian Computation (ABC) to estimate the marginal posterior distribution of γ. Our second, closely related method, instead estimates the marginal posterior distribution of via numerical integration. We show via a simulation study that our methods can accurately estimate the true value of γ, and perform well under biologically reasonable scenarios. However, we also find that our methods suffer from a relatively small positive bias for small values of γ, i.e., when one species of the parent species contributes very little to the genome of the hybrid species. We compare the performance of our method to the popular method of the ratio of f4 statistics. We do this by estimating the proportion of Neanderthal ancestry in pre-ice age European human samples and comparing our results to the finding of Fu et al. [18]. We show that our method recovers extremely similar estimates of Neanderthal ancestry with no apparent systematic bias when compared to the results of Fu et al.. Finally we apply our method to the genomes of Late Pleistocene European bison (Bison bonasus) and Steppe Bison (Bison priscus) to understand the evolutionary history of bovid megafauna in Europe over the last seventy thousand years. It was thought that before 10 kya the only bovid present in Europe was the Steppe bison. However, from bone samples found dating from the present day, and back to approximately 70 kya, mtDNA indicated a second bison species was also roaming Europe before 10 kya, more closely related to modern cattle than the Steppe bison. After nuclear DNA was sequenced, we were able to show that this new species of bovid was actually a hybrid offspring of Aurochs (the ancestor of modern cattle) and Steppe bison, an event that occurred approximately 120 kya. We used our method, in concert with the ratio of f4 statistics, to show that the hybrid species contained approximately 10% Aurochs and 90% Steppe bison ancestry.Thesis (Ph.D.) -- University of Adelaide, School of Mathematical Sciences, 201

    Cases of trisomy 21 and trisomy 18 among historic and prehistoric individuals discovered from ancient DNA

    Get PDF
    Aneuploidies, and in particular, trisomies represent the most common genetic aberrations observed in human genetics today. To explore the presence of trisomies in historic and prehistoric populations we screen nearly 10,000 ancient human individuals for the presence of three copies of any of the target autosomes. We find clear genetic evidence for six cases of trisomy 21 (Down syndrome) and one case of trisomy 18 (Edwards syndrome), and all cases are present in infant or perinatal burials. We perform comparative osteological examinations of the skeletal remains and find overlapping skeletal markers, many of which are consistent with these syndromes. Interestingly, three cases of trisomy 21, and the case of trisomy 18 were detected in two contemporaneous sites in early Iron Age Spain (800-400 BCE), potentially suggesting a higher frequency of burials of trisomy carriers in those societies. Notably, the care with which the burials were conducted, and the items found with these individuals indicate that ancient societies likely acknowledged these individuals with trisomy 18 and 21 as members of their communities, from the perspective of burial practice

    Data driven model selection and parameter estimation using semi-automatic approximate Bayesian computation to reconstruct population dynamics from ancient DNA.

    Get PDF
    Population genetics is a discipline within the biological sciences that is concerned with the change in frequency of types of individuals in a population due to natural selection, mutation, genetic drift and gene flow. Genetic drift is the part of this process explained by random sampling. Important to the process of genetic drift is population structure and so we focus on the recovery of population sizes over time, given a set of DNA sequences. With recent advances in computational power and a growth in the amount of data available, increasingly powerful techniques are being developed for the study of sequence data. Key advances in the early 1980's centred around `the coalescent', a continuous time approximation to the Wright-Fisher model of reproduction, and these advances resulted in Skyline Plot methods for recovering population size estimates over time. Skyline Plots suffer from large variances for the `coalescent' event times, and sources of error common to DNA sequence sampling schemes. Approximate Bayesian Computation (ABC) is a class of likelihood-free methods for statistical inference. ABC techniques can trace their genesis back to the biological sciences due to the complexity of the models for reproduction (and hence the intractability of likelihood calculations). Unfortunately, like Skyline Plots, ABC also suffers from many sources of error, not least of which occurs when we can not use sufficient summary statistics. To considerably reduce the effect of the error related with the use of insufficient summary statistics, we explore a process of semi-automatic summary statistic calculation through the use of `training data' (simulated under the coalescent model). We obtain a training set of data, and fit a linear model (under a Box-Cox transformation) for each parameter of interest, using common summary statistics for DNA sequences as predictor variables. We call these linear combinations of (insufficient) summary statistics the semi-automatic summary statistics, and using a new set of simulations, we perform ABC where a simulation is retained if the predicted parameter values are `close enough' to the predicted parameters for the observed data. We analyse three sets of coalescent simulated data from three population models; the Constant, Exponential and Migration Models, and compare our findings with the corresponding Skyline Plot analyses performed in BEAST. When we simulate data for training our linear model, we must specify a model of population size dynamics, and we explore methods to select a population model, given our data. A common means of model comparison used with ABC analyses is called Bayes Factors. We show that Bayes Factors perform poorly for our data, and highlight a fundamental bias inherent in any model comparison where the probability of a model, given an observed summary statistic, is employed. As an alternative to Bayes Factors, we apply multiple logistic regression (MLR) to classify our observed data into one of a candidate set of possible models. In conjunction with the MLR analysis, we use principal component analysis for visualisation, and introduce a method for attempting to identify when the correct model is not in the candidate model set, or when a classification seems reasonable. We show that this method of classification performs well for the three observed data sets using sensitivity analysis. Due to the early stage of development of our work, we can not use real world data, and so we use a different type of simulation since our method uses coalescent simulations to train the model. We obtain sequence data simulated under a `forward simulation' framework, a type of sequence simulation that looks forward in time. We define a two-step process for analysis that begins with MLR classification, and then, under a model chosen by the MLR classification, uses semi-automatic summary statistic calculation for parameter estimation via ABC. We correctly identify this model of population dynamics, and perform parameter estimation on the data, comparing our results with the corresponding BEAST Skyline Plot analysis.Thesis (M.Phil.) -- University of Adelaide, School of Mathematical Sciences, 201

    Ancient DNA gives new insights into a Norman Neolithic monumental cemetery dedicated to male elites

    Get PDF
    International audienceSignificanceBy integrating genomic and archaeological data, we provide new insights into the Neolithic French monumental site of Fleury-sur-Orne in Normandy, where a group of selected individuals was buried in impressively long monuments. The earliest individuals buried at Fleury-sur-Orne match the expected western European Neolithic genetic diversity, while three individuals, designated as genetic outliers, were buried after 4,000 calibrated BCE. We hypothesize that different, unrelated families or clans used the site over several centuries. Thirteen of 14 of the analyzed individuals were male, indicating an overarching patrilineal system. However, one exception, a female buried with a symbolically male artifact, suggests that the embodiment of the male gender in death was required to access burial at the monumental structures

    Ancient genome-wide DNA from France highlights the complexity of interactions between Mesolithic hunter-gatherers and Neolithic farmers

    Get PDF
    Starting from 12,000 years ago in the Middle East, the Neolithic lifestyle spread across Europe via separate continental and Mediterranean routes. Genomes from early European farmers have shown a clear Near Eastern/Anatolian genetic affinity with limited contribution from hunter-gatherers. However, no genomic data are available from modern-day France, where both routes converged, as evidenced by a mosaic cultural pattern. Here, we present genome-wide data from 101 individuals from 12 sites covering today’s France and Germany from the Mesolithic (N = 3) to the Neolithic (N = 98) (7000–3000 BCE). Using the genetic substructure observed in European hunter-gatherers, we characterize diverse patterns of admixture in different regions, consistent with both routes of expansion. Early western European farmers show a higher proportion of distinctly western hunter-gatherer ancestry compared to central/southeastern farmers. Our data highlight the complexity of the biological interactions during the Neolithic expansion by revealing major regional variations
    corecore