42 research outputs found

    Alignment and analysis of noncoding DNA sequences in Drosophila

    Get PDF

    Statistical Population Genomics

    Get PDF
    This open access volume presents state-of-the-art inference methods in population genomics, focusing on data analysis based on rigorous statistical techniques. After introducing general concepts related to the biology of genomes and their evolution, the book covers state-of-the-art methods for the analysis of genomes in populations, including demography inference, population structure analysis and detection of selection, using both model-based inference and simulation procedures. Last but not least, it offers an overview of the current knowledge acquired by applying such methods to a large variety of eukaryotic organisms. Written in the highly successful Methods in Molecular Biology series format, chapters include introductions to their respective topics, pointers to the relevant literature, step-by-step, readily reproducible laboratory protocols, and tips on troubleshooting and avoiding known pitfalls. Authoritative and cutting-edge, Statistical Population Genomics aims to promote and ensure successful applications of population genomic methods to an increasing number of model systems and biological questions

    Evolutionary Inference from Admixed Genomes: Implications of Hybridization for Biodiversity Dynamics and Conservation

    Get PDF
    Hybridization as a macroevolutionary mechanism has been historically underappreciated among vertebrate biologists. Yet, the advent and subsequent proliferation of next-generation sequencing methods has increasingly shown hybridization to be a pervasive agent influencing evolution in many branches of the Tree of Life (to include ancestral hominids). Despite this, the dynamics of hybridization with regards to speciation and extinction remain poorly understood. To this end, I here examine the role of hybridization in the context of historical divergence and contemporary decline of several threatened and endangered North American taxa, with the goal to illuminate implications of hybridization for promoting—or impeding—population persistence in a shifting adaptive landscape. Chapter I employed population genomic approaches to examine potential effects of habitat modification on species boundary stability in co-occurring endemic fishes of the Colorado River basin (Gila robusta and G. cypha). Results showed how one potential outcome of hybridization might drive species decline: via a breakdown in selection against interspecific heterozygotes and subsequent genetic erosion of parental species. Chapter II explored long-term contributions of hybridization in an evolutionarily recent species complex (Gila) using a combination of phylogenomic and phylogeographic modelling approaches. Massively parallel computational methods were developed (and so deployed) to categorize sources of phylogenetic discordance as drivers of systematic bias among a panel of species tree inference algorithms. Contrary to past evidence, we found that hypotheses of hybrid origin (excluding one notable example) were instead explained by gene-tree discordance driven by a rapid radiation. Chapter III examined patterns of local ancestry in the endangered red wolf genome (Canis rufus) – a controversial taxon of a long-standing debate about the origin of the species. Analyses show how pervasive autosomal introgression served to mask signatures of prior isolation—in turn misleading analyses that led the species to be interpreted as of recent hybrid origin. Analyses also showed how recombination interacts with selection to create a non-random, structured genomic landscape of ancestries with, in the case of the red wolf, the ‘original’ species tree being retained only in low-recombination ‘refugia’ of the X chromosome. The final three chapters present bioinformatic software that I developed for my dissertation research to facilitate molecular approaches and analyses presented in Chapters I–III. Chapter IV details an in-silico method for optimizing similar genomic methods as used herein (RADseq of reduced representation libraries) for other non-model organisms. Chapter V describes a method for parsing genomic datasets for elements of interest, either as a filtering mechanism for downstream analysis, or as a precursor to targeted-enrichment reduced-representation genomic sequencing. Chapter VI presents a rapid algorithm for the definition of a ‘most parsimonious’ set of recombinational breakpoints in genomic datasets, as a method promoting local ancestry analyses as utilized in Chapter III. My three case studies and accompanying software promote three trajectories in modern hybridization research: How does hybridization impact short-term population persistence? How does hybridization drive macroevolutionary trends? and How do outcomes of hybridization vary in the genome? In so doing, my research promotes a deeper understanding of the role that hybridization has and will continue to play in governing the evolutionary fates of lineages at both contemporary and historic timescales

    Chromosome rearrangements and population genomics

    Get PDF
    Chromosome rearrangements result in changes to the physical linkage and order of sequences in the genome. Although we have known about these mutations for more than a century, we still lack a detailed understanding of how they become fixed and what their effect is on other evolutionary processes. Analysing genome sequences provides a way to address this knowledge gap. In this thesis I compare genome assemblies and use population genomic inference to gain a better understanding of the role that chromosome rearrangements play in evolution. I focus on butterflies in the genus Brenthis, where chromosome numbers are known to vary between species. In chapter 2, I present a genome assembly of Brenthis ino and show that its genome has been shaped by many chromosome rearrangements, including a Z-autosome fusion that is still segregating. In chapter 3, I investigate how synteny information in genome sequences can be used to infer ancestral linkage groups and inter-chromosomal rearrangements, implementing the methods in a command-line tool. In chapter 4, I test whether chromosome fissions and fusions have acted as barriers to gene flow between B. ino and its sister species B. daphne. I find that chromosomes involved in rearrangements have experienced less post-divergence gene flow than the rest of the genome, suggesting that rearrangements have promoted speciation. Finally, in chapter 5, I investigate how chromosome rearrangements have become fixed in B. ino, B. daphne, and a third species, B. hecate. I show that genetic drift is unlikely to be a strong enough force to have fixed very underdominant rearrangements, and that there is only weak evidence that chromosome fusions have become fixed through positive natural selection. In summary, this work provides methods for researching chromosome evolution as well as new results about how rearrangements evolve and impact the speciation process

    Phylogenetics in the Genomic Era

    Get PDF
    Molecular phylogenetics was born in the middle of the 20th century, when the advent of protein and DNA sequencing offered a novel way to study the evolutionary relationships between living organisms. The first 50 years of the discipline can be seen as a long quest for resolving power. The goal – reconstructing the tree of life – seemed to be unreachable, the methods were heavily debated, and the data limiting. Maybe for these reasons, even the relevance of the whole approach was repeatedly questioned, as part of the so-called molecules versus morphology debate. Controversies often crystalized around long-standing conundrums, such as the origin of land plants, the diversification of placental mammals, or the prokaryote/eukaryote divide. Some of these questions were resolved as gene and species samples increased in size. Over the years, molecular phylogenetics has gradually evolved from a brilliant, revolutionary idea to a mature research field centred on the problem of reliably building trees. This logical progression was abruptly interrupted in the late 2000s. High-throughput sequencing arose and the field suddenly moved into something entirely different. Access to genome-scale data profoundly reshaped the methodological challenges, while opening an amazing range of new application perspectives. Phylogenetics left the realm of systematics to occupy a central place in one of the most exciting research fields of this century – genomics. This is what this book is about: how we do trees, and what we do with trees, in the current phylogenomic era. One obvious, practical consequence of the transition to genome-scale data is that the most widely used tree-building methods, which are based on probabilistic models of sequence evolution, require intensive algorithmic optimization to be applicable to current datasets. This problem is considered in Part 1 of the book, which includes a general introduction to Markov models (Chapter 1.1) and a detailed description of how to optimally design and implement Maximum Likelihood (Chapter 1.2) and Bayesian (Chapter 1.4) phylogenetic inference methods. The importance of the computational aspects of modern phylogenomics is such that efficient software development is a major activity of numerous research groups in the field. We acknowledge this and have included seven "How to" chapters presenting recent updates of major phylogenomic tools – RAxML (Chapter 1.3), PhyloBayes (Chapter 1.5), MACSE (Chapter 2.3), Bgee (Chapter 4.3), RevBayes (Chapter 5.2), Beagle (Chapter 5.4), and BPP (Chapter 5.6). Genome-scale data sets are so large that statistical power, which had been the main limiting factor of phylogenetic inference during previous decades, is no longer a major issue. Massive data sets instead tend to amplify the signal they deliver – be it biological or artefactual – so that bias and inconsistency, instead of sampling variance, are the main problems with phylogenetic inference in the genomic era. Part 2 covers the issues of data quality and model adequacy in phylogenomics. Chapter 2.1 provides an overview of current practice and makes recommendations on how to avoid the more common biases. Two chapters review the challenges and limitations of two key steps of phylogenomic analysis pipelines, sequence alignment (Chapter 2.2) and orthology prediction (Chapter 2.4), which largely determine the reliability of downstream inferences. The performance of tree building methods is also the subject of Chapter 2.5, in which a new approach is introduced to assess the quality of gene trees based on their ability to correctly predict ancestral gene order. Analyses of multiple genes typically recover multiple, distinct trees. Maybe the biggest conceptual advance induced by the phylogenetic to phylogenomic transition is the suggestion that one should not simply aim to reconstruct “the” species tree, but rather to be prepared to make sense of forests of gene trees. Chapter 3.1 reviews the numerous reasons why gene trees can differ from each other and from the species tree, and what the implications are for phylogenetic inference. Chapter 3.2 focuses on gene trees/species trees reconciliation methods that account for gene duplication/loss and horizontal gene transfer among lineages. Incomplete lineage sorting is another major source of phylogenetic incongruence among loci, which recently gained attention and is covered by Chapter 3.3. Chapter 3.4 concludes this part by taking a user’s perspective and examining the pros and cons of concatenation versus separate analysis of gene sequence alignments. Modern genomics is comparative and phylogenetic methods are key to a wide range of questions and analyses relevant to the study of molecular evolution. This is covered by Part 4. We argue that genome annotation, either structural or functional, can only be properly achieved in a phylogenetic context. Chapters 4.1 and 4.2 review the power of these approaches and their connections with the study of gene function. Molecular substitution rates play a key role in our understanding of the prevalence of nearly neutral versus adaptive molecular evolution, and the influence of species traits on genome dynamics (Chapter 4.4). The analysis of substitution rates, and particularly the detection of positive selection, requires sophisticated methods and models of coding sequence evolution (Chapter 4.5). Phylogenomics also offers a unique opportunity to explore evolutionary convergence at a molecular level, thus addressing the long-standing question of predictability versus contingency in evolution (Chapter 4.6). The development of phylogenomics, as reviewed in Parts 1 through 4, has resulted in a powerful conceptual and methodological corpus, which is often reused for addressing problems of interest to biologists from other fields. Part 5 illustrates this application potential via three selected examples. Chapter 5.1 addresses the link between phylogenomics and palaeontology; i.e., how to optimally combine molecular and fossil data for estimating divergence times. Chapter 5.3 emphasizes the importance of the phylogenomic approach in virology and its potential to trace the origin and spread of infectious diseases in space and time. Finally, Chapter 5.5 recalls why phylogenomic methods and the multi-species coalescent model are key in addressing the problem of species delimitation – one of the major goals of taxonomy. It is hard to predict where phylogenomics as a discipline will stand in even 10 years. Maybe a novel technological revolution will bring it to yet another level? We strongly believe, however, that tree thinking will remain pivotal in the treatment and interpretation of the deluge of genomic data to come. Perhaps a prefiguration of the future of our field is provided by the daily monitoring of the current Covid-19 outbreak via the phylogenetic analysis of coronavirus genomic data in quasi real time – a topic of major societal importance, contemporary to the publication of this book, in which phylogenomics is instrumental in helping to fight disease

    Predicting Functional Alterations Caused By Non-synonymous Variants in CHO Using Models Based on Phylogenetic Tree and Evolutionary Preservation

    Get PDF
    Chinese Hamster Ovary (CHO) cell is a major manufacturing platform for one of the most valuable biopharmaceutical products: monoclonal antibodies. Being an immortal cell line adapted to different environments, CHO has been accumulating massive mutations in its genome. Continuous effort has been invested into building a computational model to predict CHO cell productivity. However, not much attention has been focused on its proteins which are surely effected by the mutations accumulated to some extent. In this project, we focused on the functional effect caused by non-synonymous variants found in CHO genome. A tool was built to firstly identify these variants and then predict their potential function effect by preservation, a concept derived from evolutionary conservation. Firstly, the PANTHER subfamilies, which defined on the base of potential function change within gene trees, were extended by adding proteins from species not covered by PANTHER. Sequences within the same subfamily were then aligned and had Hidden Markov Models (HMMs) built on these alignments. The HMMs were used to identify homologs in CHO proteins. After that preservation were calculated in every site of the alignments, which was then used to predict the function alterations caused by mutations on every site. Our tool was then validated using data from origin PANTHER subfamilies, PANTHER-PSEP which also calculated site preservation and BLAST, a well-accepted homolog searching algorithm. CHO protein sequences were then imported and analysed by our tool. For comparison, protein sequences from Chinese hamster were also analysed alone with two published CHO cell lines: CHO-K1 and CHO-K1GS. The predictions of proteins from these three genomes were then compared by mapping onto Gene Ontology (GO). Some detailed case studies were also demonstrated. Our tool showed good performance in validations, however, they failed to produce useful hypotheses that would motivate further experiments on bench. The potential causes are discussed at the end

    Vertebrate phylogenomics and gene family evolution

    Get PDF
    This thesis is about 2 topics; the evolution of gene families by the birth-death process of gene duplication and gene loss, and phylogenetic inference. It is a central theme that these two processes are intimately associated - the phylogenies of gene families (of any gene) are shaped by the processes of gene duplication and gene loss, as much as by the processes of speciation and extinction occurring among the species the gene is evolving in. This has two results. Firstly, that we need to know, or assume, something about the processes of gene duplication and loss to correctly understand the pattern of speciation, or cladogenesis, in a group of organisms. Secondly, that we need to know, or assume, something about this pattern if we are to fully appreciate the effect of gene duplication and loss on a gene family phylogeny.The main part of this thesis investigates the use of reconciled tree methods in unravelling species phylogeny and the evolution of gene families. Part of this investigation involves placing reconciled tree methods (and the use of these methods to infer species phylogeny, known as gene tree parsimony), in the context of some related methods: supertree methods and "simultaneous analysis" of combined data. Two empirical studies complete this part of the thesis - one attempting to infer the higher-level phylogeny of vertebrates using gene tree parsimony, and another focusing on a lower taxonomic level, on primate phylogeny. This chapter attempts an integrated study of gene duplication and species phylogeny, which uses information about gene duplication to help date evolutionary events.Despite the close relationship between gene duplication and speciation on phylogenies, it is possible to study gene duplication independently. If we restrict ourselves to genes sampled from a single genome, gene family trees represent gene duplications and gene losses occurring during the history of a single species, so the complication of speciation and extinction is eliminated. By realising that the processes of gene duplication and loss in these trees are analogous to the processes of speciation and extinction in species phylogenies, we can harness a toolkit of methods developed for more traditional phylogenies to study these molecular processes. Two such methods are models of cladistic tree shape and birth-death models, which allow the first estimates of the rate of gene loss

    Recent identity by descent in human genetic data - methods and applications

    Get PDF
    The thesis describes algorithms for detecting regions of recent identity by descent (IBD) from human genetic data and its applications in optimising resequencing studies, genomic predictions and detecting Mendelian subtypes of diseases. Firstly, we describe the algorithm ANCHAP, which scans pairs of multi-point SNP genotypes for sharing IBD of long haplotypes. A comparison with other methods shows that ANCHAP outperforms them in terms of speed or accuracy. We demonstrate the algorithm on data from population isolates - from Orcades, Croatian islands, and from a population of unrelated individuals. We compare the abundance of IBD segments between cohorts, and identify genetic regions where IBD is most common. Secondly, we verify the IBD regions detected from array data against exome sequence data. We estimate that where sharing IBD between a pair of individuals is inferred, this is confirmed by exome data in 98% of cases. Correctness of IBD detection varies with settings of ANCHAP, length of IBD segments, and position with respect to segment endpoints. We find that with sample sizes of 1000 individuals from an isolated population genotyped using a dense SNP array, and with 20% of these individuals sequenced, 65% of sequences of the un-sequenced subjects can be partially inferred. Implementation of such resequencing strategies requires an IBD-based imputation algorithm, which is outlined. Thirdly, we use recent IBD to detect carriers of Mendelian subtypes of colon cancer. We show this with the example of Lynch syndrome, which accounts for about 3% of colon cancer patients. We detect IBD sharing between known and unknown carriers around DNA mismatch-repair genes. Using the IBD relationship, we build and evaluate a model that predicts presence of Lynch Syndrome mutations. Finally, we discuss whether regions of identity by descent can be used for genomic predictions. We conclude that the utility of the inferred IBD regions depends on accuracy of detection, time to most recent common ancestors and mutation rates since

    Simulation and graph mining tools for improving gene mapping efficiency

    Get PDF
    Gene mapping is a systematic search for genes that affect observable characteristics of an organism. In this thesis we offer computational tools to improve the efficiency of (disease) gene-mapping efforts. In the first part of the thesis we propose an efficient simulation procedure for generating realistic genetical data from isolated populations. Simulated data is useful for evaluating hypothesised gene-mapping study designs and computational analysis tools. As an example of such evaluation, we demonstrate how a population-based study design can be a powerful alternative to traditional family-based designs in association-based gene-mapping projects. In the second part of the thesis we consider a prioritisation of a (typically large) set of putative disease-associated genes acquired from an initial gene-mapping analysis. Prioritisation is necessary to be able to focus on the most promising candidates. We show how to harness the current biomedical knowledge for the prioritisation task by integrating various publicly available biological databases into a weighted biological graph. We then demonstrate how to find and evaluate connections between entities, such as genes and diseases, from this unified schema by graph mining techniques. Finally, in the last part of the thesis, we define the concept of reliable subgraph and the corresponding subgraph extraction problem. Reliable subgraphs concisely describe strong and independent connections between two given vertices in a random graph, and hence they are especially useful for visualising such connections. We propose novel algorithms for extracting reliable subgraphs from large random graphs. The efficiency and scalability of the proposed graph mining methods are backed by extensive experiments on real data. While our application focus is in genetics, the concepts and algorithms can be applied to other domains as well. We demonstrate this generality by considering coauthor graphs in addition to biological graphs in the experiments.Geenikartoitus on organismin havaittaviin piirteisiin vaikuttavien geenien järjestelmällistä etsintää perimästä. Väitöskirjassa esitetään uusia menetelmiä, joilla voidaan tehostaa sairauksille altistavien geenien kartoitusta. Väitöskirjan alussa tarkastellaan perimän simulointia (tyypillisesti maantieteellisesti) eristäytyneissä populaatioissa ja esitetään tarkoitukseen soveltuva uusi simulaattoriohjelmisto. Simuloidut aineistot ovat hyödyllisiä tutkimussuunnittelussa, jolloin niillä voidaan arvioida suunniteltujen aineistojen tilastollisia ominaisuuksia sekä käytettävien analysointimenetelmien toimintaa. Esimerkkinä tällaisesta tutkimuksesta työssä käydään läpi esitetyllä ohjelmistolla tehty laajahko simulaatiotutkimus. Tulosten perusteella väestöpohjainen tapaus-verrokkitutkimusasetelma vaikuttaa olevan tilastollisesti voimakas vaihtoehto kalliimmille perhe- ja sukupuupohjaisille asetelmille. Toinen osa väitöskirjaa käsittelee mahdollisesti sairauksille altistavien ns. ehdokasgeenien pisteytystä sen mukaan, kuinka vahvat yhteydet niillä on tutkittavaan sairauteen. Pisteytys on tärkeää, koska alustavat aineiston tarkastelut tuottavat tyypillisesti runsaasti ehdokasgeenejä, joiden kaikkien läpikäynti olisi liian työlästä. Pisteytyksellä jatkotutkimukset voidaan kohdistaa lupaavimpiin ehdokkaisiin. Työssä esitetään kuinka tällä hetkellä erillissä tietokannoissa oleva biologinen tieto voidaan esittää yhteinäisessä verkkomuodossa. Lisäksi näytetään kuinka tällaisesta aineistosta voidaan etsiä ehdokasgeenien ja tutkittavan sairauden välisiä yhteyksiä ja pisteyttää niitä verkonlouhinta-algoritmien avulla. Lopuksi työssä esitetään luotettavan aliverkon eristämisongelma ja algoritmeja sen ratkaisemiseen. Ongelmassa tavoitteena on poimia suuresta verkosta suhteellisen pieni aliverkko, joka sisältää vahvoja ja toisistaan riippumattomia yhteyksiä kahden annetun verkon solmun välillä. Siten luotettavat aliverkot soveltuvat erityisen hyvin löydettyjen yhteyksien kuvalliseen esittämiseen. Luotettavia aliverkkoja voidaan soveltaa perinnöllisyystieteen lisäksi myös muilla aloilla, kuten sosiaalisten verkkojen analyysissä

    Haplotype estimation in polyploids using DNA sequence data

    Get PDF
    Polyploid organisms possess more than two copies of their core genome and therefore contain k>2 haplotypes for each set of ordered genomic variants. Polyploidy occurs often within the plant kingdom, among others in important corps such as potato (k=4) and wheat (k=6). Current sequencing technologies enable us to read the DNA and detect genomic variants, but cannot distinguish between the copies of the genome, each inherited from one of the parents. To detect inheritance patterns in populations, it is necessary to know the haplotypes, as alleles that are in linkage over the same chromosome tend to be inherited together. In this work, we develop mathematical optimisation algorithms to indirectly estimate haplotypes by looking into overlaps between the sequence reads of an individual, as well as into the expected inheritance of the alleles in a population. These algorithm deal with sequencing errors and random variations in the counts of reads observed from each haplotype. These methods are therefore of high importance for studying the genetics of polyploid crops. </p
    corecore