108 research outputs found
Orienting Future Trends in Local Ancestry Deconvolution Models to Optimally Decipher Admixed Individual Genome Variations
Rapid advances in sequencing and genotyping technologies have significantly contributed to shaping the area of medical and population genetics. Several thousand genomes are completed with millions of variants identified in the human deoxyribonucleic acid (DNA) sequences. These genomic variations highly influence changes in phenotypic manifestations and physiological functions of different individuals or population groups. Of particular importance are variations introduced by admixture event, contributing significantly to a remarkable phenotypic variability with medical and/or evolutionary implications. In this case, knowledge of local ancestry estimates and date of admixture is of utmost importance for a better understanding of genomic variation patterns throughout modern human evolution and adaptive processes. In this chapter, we survey existing local ancestry deconvolution and dating admixture event models to identify possible gaps that still need to be filled and orient future trends in designing more effective models, which account for current challenges and produce more accurate and biological relevant estimates
Modeling Population Structure Under Hierarchical Dirichlet Processes
We propose a Bayesian nonparametric model to infer population admixture, extending the hierarchical Dirichlet process to allow for correlation between loci due to linkage disequilibrium. Given multilocus genotype data from a sample of individuals, the proposed model allows inferring and classifying individuals as unadmixed or admixed, inferring the number of subpopulations ancestral to an admixed population and the population of origin of chromosomal regions. Our model does not assume any specific mutation process, and can be applied to most of the commonly used genetic markers. We present a Markov chain Monte Carlo (MCMC) algorithm to perform posterior inference from the model and we discuss some methods to summarize the MCMC output for the analysis of population admixture. Finally, we demonstrate the performance of the proposed model in a real application, using genetic data from the ectodysplasin-A receptor (EDAR) gene, which is considered to be ancestry-informative due to well-known variations in allele frequency as well as phenotypic effects across ancestry. The structure analysis of this dataset leads to the identification of a rare haplotype in Europeans. We also conduct a simulated experiment and show that our algorithm outperforms parametric methods
Identification of breed contributions in crossbred dogs
There has been a strong public interest recently in the interrogation of canine ancestries using direct-toconsumer (DTC) genetic ancestry inference tools. Our goal is to improve the accuracy of the associated computational tools, by developing superior algorithms for identifying the breed composition of mixedbreed dogs. Genetic test data has been provided by Mars Veterinary, using SNP markers. We approach this ancestry inference problem from two main directions. The first approach is optimized for datasets composed of a small number of ancestry informative markers (AIM). Firstly, we compute haplotype frequencies from purebred ancestral panels which characterize genetic variation within breeds and are utilized to predict breed compositions. Due to a large number of possible breed combinations in admixed dogs we approximately sample this search space with a Metropolis-Hastings algorithm. As proposal density we either uniformly sample new breeds for the lineage, or we bias the Markov Chain so that breeds in the lineage are more likely to be replaced by similar breeds. The second direction we explore is dominated by HMM approaches which view genotypes as realizations of latent variable sequences corresponding to breeds. In this approach an admixed canine sample is viewed as a linear combination of segments from dogs in the ancestral panel. Results were evaluated using two different performance measures. Firstly, we looked at a generalization of binary ROC-curves to multi-class classification problems. Secondly, to more accurately judge breed contribution approximations we computed the difference between expected and predicted breed contributions. Experimental results on a synthetic, admixed test dataset using AIMs showed that the MCMC approach successfully predicts breed proportions for a variety of lineage complexities. Furthermore, due to exploration in the MCMC algorithm true breed contributions are underestimated. The HMM approach performed less well which is presumably due to using less information of the dataset
Evolutionary Inference from Admixed Genomes: Implications of Hybridization for Biodiversity Dynamics and Conservation
Hybridization as a macroevolutionary mechanism has been historically underappreciated among vertebrate biologists. Yet, the advent and subsequent proliferation of next-generation sequencing methods has increasingly shown hybridization to be a pervasive agent influencing evolution in many branches of the Tree of Life (to include ancestral hominids). Despite this, the dynamics of hybridization with regards to speciation and extinction remain poorly understood. To this end, I here examine the role of hybridization in the context of historical divergence and contemporary decline of several threatened and endangered North American taxa, with the goal to illuminate implications of hybridization for promoting—or impeding—population persistence in a shifting adaptive landscape.
Chapter I employed population genomic approaches to examine potential effects of habitat modification on species boundary stability in co-occurring endemic fishes of the Colorado River basin (Gila robusta and G. cypha). Results showed how one potential outcome of hybridization might drive species decline: via a breakdown in selection against interspecific heterozygotes and subsequent genetic erosion of parental species.
Chapter II explored long-term contributions of hybridization in an evolutionarily recent species complex (Gila) using a combination of phylogenomic and phylogeographic modelling approaches. Massively parallel computational methods were developed (and so deployed) to categorize sources of phylogenetic discordance as drivers of systematic bias among a panel of species tree inference algorithms. Contrary to past evidence, we found that hypotheses of hybrid origin (excluding one notable example) were instead explained by gene-tree discordance driven by a rapid radiation.
Chapter III examined patterns of local ancestry in the endangered red wolf genome (Canis rufus) – a controversial taxon of a long-standing debate about the origin of the species. Analyses show how pervasive autosomal introgression served to mask signatures of prior isolation—in turn misleading analyses that led the species to be interpreted as of recent hybrid origin. Analyses also showed how recombination interacts with selection to create a non-random, structured genomic landscape of ancestries with, in the case of the red wolf, the ‘original’ species tree being retained only in low-recombination ‘refugia’ of the X chromosome.
The final three chapters present bioinformatic software that I developed for my dissertation research to facilitate molecular approaches and analyses presented in Chapters I–III. Chapter IV details an in-silico method for optimizing similar genomic methods as used herein (RADseq of reduced representation libraries) for other non-model organisms. Chapter V describes a method for parsing genomic datasets for elements of interest, either as a filtering mechanism for downstream analysis, or as a precursor to targeted-enrichment reduced-representation genomic sequencing. Chapter VI presents a rapid algorithm for the definition of a ‘most parsimonious’ set of recombinational breakpoints in genomic datasets, as a method promoting local ancestry analyses as utilized in Chapter III.
My three case studies and accompanying software promote three trajectories in modern hybridization research: How does hybridization impact short-term population persistence? How does hybridization drive macroevolutionary trends? and How do outcomes of hybridization vary in the genome? In so doing, my research promotes a deeper understanding of the role that hybridization has and will continue to play in governing the evolutionary fates of lineages at both contemporary and historic timescales
A nonparametric HMM for genetic imputation and coalescent inference
Genetic sequence data are well described by hidden Markov models (HMMs) in
which latent states correspond to clusters of similar mutation patterns. Theory
from statistical genetics suggests that these HMMs are nonhomogeneous (their
transition probabilities vary along the chromosome) and have large support for
self transitions. We develop a new nonparametric model of genetic sequence
data, based on the hierarchical Dirichlet process, which supports these self
transitions and nonhomogeneity. Our model provides a parameterization of the
genetic process that is more parsimonious than other more general nonparametric
models which have previously been applied to population genetics. We provide
truncation-free MCMC inference for our model using a new auxiliary sampling
scheme for Bayesian nonparametric HMMs. In a series of experiments on male X
chromosome data from the Thousand Genomes Project and also on data simulated
from a population bottleneck we show the benefits of our model over the popular
finite model fastPHASE, which can itself be seen as a parametric truncation of
our model. We find that the number of HMM states found by our model is
correlated with the time to the most recent common ancestor in population
bottlenecks. This work demonstrates the flexibility of Bayesian nonparametrics
applied to large and complex genetic data
Using Dirichlet Process Priors For Bayesian Mixture Clustering
We describe a non-parametric Bayesian model using genotype data to classify individuals among populations where the total number of populations is unknown. The model assumes that a population is characterized by a set of allele frequencies that follow multinomial distributions. The Dirichlet Process is applied as the prior distribution. The method estimates the number of populations together with the allele frequencies and the ancestry coefficients of each individual. Distance matrices and bootstrap support numbers based on MCMC runs are generated to create a phylogeny of the ancestral populations
Genetic dating and pattern of admixture in modern human evolution
Genetic variation is shaped by admixture between populations in an evolutionary process. The mixture dynamic between groups of populations results in a mosaic of chromosomal segments inherited from multiple ancestral populations. The distribution of ancestral chromosomal segments and the recombination breakpoints in an admixed genome provide information about the time of admixture. Studying populations with particular ancestries has become a major interest in population genetics because of medical and evolutionary impacts of the patterns of single nucleotide polymorphisms. It provides a better understanding of the impact of population migrations and helps us uncover interactions between several populations. Most of the research on admixed population dating has focused on a single interaction between two populations using various approaches. Some have extended this to mixing of three populations based on assumptions and approaches which differ from one tool to another. However, the inference of distinct ancestral proportions along the genome of an admixed individual and plausible dates of admixture, still remain a challenge in the case of multi-way admixed populations. This dissertation consists of three research initiatives. First, provide a succinct review of current methods for dating the admixture events. We accomplish this by providing a comprehensive review and comparison of current methods pertinent to date admixture event. Second, we assess various admixture dating tools which estimate the time of admixture between two parental populations. We do so by performing various simulations assuming a particular number of generations and use these to evaluate the tools. Third, we apply the top three assessed methods to some admixed populations from the 1000 Genomes project. Despite MALDER shows improvement and produces reasonable date estimates over other current methods, the results from both simulation and real data suggest that dating ancient admixture events accounting for the effect of other admixtures remains a challenge. Our results suggest the need for developing a new approach to date ancient and complex admixture events in multi-way admixed populations
Mapping genes underlying ethnic differences in tuberculosis risk by linkage disequilibrium in the South African coloured population of the Western Cape
Includes bibliographical references.The South Africa Coloured population of the Western Cape is the result of unions between Europeans, Africans (Bantu and Khoisan), and various other populations (Malaysian or Indonesian descent). The world-wide burden of tuberculosis remains an enormous problem, and is particularly severe in this population. In general, admixed populations that have arisen in historical times can make an important contribution to the discovery of disease susceptibility genes if the parental populations exhibit substantial variation in susceptibility. Despite numerous successful genome-wide association studies, detecting variants that have low disease risk still poses a challenge. Furthermore, admixture association studies for multi-way admixed populations pose constant challenges, including the choice of an accurate ancestral panel to infer ancestry and for imputing missing genotypes to identify possible genetic variants causing susceptibility to disease. This thesis addresses some of these challenges. We first developed PROXYANC, an approach to select the best proxy ancestral populations for admixed populations. From the simulation of a multi-way admixed population, we demonstrated the ability and accuracy of PROXYANC in selecting the best proxy ancestry and illustrated the importance of the choice of ancestries in both estimating admixture proportions and imputing missing genotypes. We applied this approach to the South African Coloured population, to refine both the choice of ancestral populations and their genetic contributions. We also demonstrated that the ancestral allele frequency differences correlated with increased linkage disequilibrium in the SAC, and that the increased LD originates from admixture events rather than population bottlenecks. Secondly, we conducted a study to determine whether ancestry-specific genetic contributions affect tuberculosis risk. We additionally conducted imputation genome-wide association studies and a meta-analysis incorporating previous genome-wide association studies of tuberculosis
- …