151 research outputs found

    Joint estimation of gene conversion rates and mean conversion tract lengths from population SNP data

    Get PDF
    Motivation: Two known types of meiotic recombination are crossovers and gene conversions. Although they leave behind different footprints in the genome, it is a challenging task to tease apart their relative contributions to the observed genetic variation. In particular, for a given population SNP dataset, the joint estimation of the crossover rate, the gene conversion rate and the mean conversion tract length is widely viewed as a very difficult problem

    A human genome-wide library of local phylogeny predictions for whole-genome inference problems

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Many common inference problems in computational genetics depend on inferring aspects of the evolutionary history of a data set given a set of observed modern sequences. Detailed predictions of the full phylogenies are therefore of value in improving our ability to make further inferences about population history and sources of genetic variation. Making phylogenetic predictions on the scale needed for whole-genome analysis is, however, extremely computationally demanding.</p> <p>Results</p> <p>In order to facilitate phylogeny-based predictions on a genomic scale, we develop a library of maximum parsimony phylogenies within local regions spanning all autosomal human chromosomes based on Haplotype Map variation data. We demonstrate the utility of this library for population genetic inferences by examining a tree statistic we call 'imperfection,' which measures the reuse of variant sites within a phylogeny. This statistic is significantly predictive of recombination rate, shows additional regional and population-specific conservation, and allows us to identify outlier genes likely to have experienced unusual amounts of variation in recent human history.</p> <p>Conclusion</p> <p>Recent theoretical advances in algorithms for phylogenetic tree reconstruction have made it possible to perform large-scale inferences of local maximum parsimony phylogenies from single nucleotide polymorphism (SNP) data. As results from the imperfection statistic demonstrate, phylogeny predictions encode substantial information useful for detecting genomic features and population history. This data set should serve as a platform for many kinds of inferences one may wish to make about human population history and genetic variation.</p

    When two trees go to war

    Get PDF
    Rooted phylogenetic networks are often constructed by combining trees, clusters, triplets or characters into a single network that in some well-defined sense simultaneously represents them all. We review these four models and investigate how they are related. In general, the model chosen influences the minimum number of reticulation events required. However, when one obtains the input data from two binary trees, we show that the minimum number of reticulations is independent of the model. The number of reticulations necessary to represent the trees, triplets, clusters (in the softwired sense) and characters (with unrestricted multiple crossover recombination) are all equal. Furthermore, we show that these results also hold when not the number of reticulations but the level of the constructed network is minimised. We use these unification results to settle several complexity questions that have been open in the field for some time. We also give explicit examples to show that already for data obtained from three binary trees the models begin to diverge

    A Comparison of Phylogenetic Network Methods Using Computer Simulation

    Get PDF
    Background: We present a series of simulation studies that explore the relative performance of several phylogenetic network approaches (statistical parsimony, split decomposition, union of maximum parsimony trees, neighbor-net, simulated history recombination upper bound, median-joining, reduced median joining and minimum spanning network) compared to standard tree approaches, (neighbor-joining and maximum parsimony) in the presence and absence of recombination. Principal Findings: In the absence of recombination, all methods recovered the correct topology and branch lengths nearly all of the time when the substitution rate was low, except for minimum spanning networks, which did considerably worse. At a higher substitution rate, maximum parsimony and union of maximum parsimony trees were the most accurate. With recombination, the ability to infer the correct topology was halved for all methods and no method could accurately estimate branch lengths. Conclusions: Our results highlight the need for more accurate phylogenetic network methods and the importance of detecting and accounting for recombination in phylogenetic studies. Furthermore, we provide useful information for choosing a network algorithm and a framework in which to evaluate improvements to existing methods and nove

    Computational Methods for Detecting Large-Scale Chromosome Rearrangements in SNP Data

    Get PDF
    Large-scale chromosome rearrangements such as copy number variants (CNVs) and inversions encompass a considerable proportion of the genetic variation between human individuals. In a number of cases, they have been closely linked with various inheritable diseases. Single-nucleotide polymorphisms (SNPs) are another large part of the genetic variance between individuals. They are also typically abundant and their measuring is straightforward and cheap. This thesis presents computational means of using SNPs to detect the presence of inversions and deletions, a particular variety of CNVs. Technically, the inversion-detection algorithm detects the suppressed recombination rate between inverted and non-inverted haplotype populations whereas the deletion-detection algorithm uses the EM-algorithm to estimate the haplotype frequencies of a window with and without a deletion haplotype. As a contribution to population biology, a coalescent simulator for simulating inversion polymorphisms has been developed. Coalescent simulation is a backward-in-time method of modelling population ancestry. Technically, the simulator also models multiple crossovers by using the Counting model as the chiasma interference model. Finally, this thesis includes an experimental section. The aforementioned methods were tested on synthetic data to evaluate their power and specificity. They were also applied to the HapMap Phase II and Phase III data sets, yielding a number of candidates for previously unknown inversions, deletions and also correctly detecting known such rearrangements.Ihmisten perimissä on yksilöllistä vaihtelua. Tämä vaihtelu voi olla useaa eri tyyppiä. Esimerkiksi yksittäisiä emäspareja koskettavat pistemutaatiot ovat usein helposti ja halvasti mitattavissa. Perimä voi kuitenkin vaihdella myös suuremmalla mittakaavalla. Osa perimästä voi olla joissakin tapauksissa kääntynyt toisin päin tai saattaa puuttua kokonaan; edellistä vaihtelutyyppiä kutsutaan inversioksi ja jälkimmäistä deletioksi. Inversioiden ja deletioiden tunnistaminen perimästä ei ole yhtä helppoa kuin pistemutaatioista seuranneiden SNP:ien (single nucleotide polymorphism) mittaaminen. Tässä väitöstyössä on kehitetty menetelmiä, jotka pyrkivät tunnistamaan inversioiden ja deletioiden jälkiä SNP-aineistoista. Menetelmien tavoitteena on ohjeistaa, mitä alueita perimästä on syytä tarkastella muilla tarkemmilla, mutta kalliimmilla, keinoilla tällaisten suurten perimämuutosten tunnistamiseksi. Väitöstyössä esitetään myös tietokoneohjelma, joka tuottaa inversion sisältäviä synteettisiä SNP-aineistoja. Tämän ohjelman avulla tarkastellaan eri inversiontunnistusmenetelmien toimivuutta erilaisissa koetilanteissa. Kokeiden mukaan tietynlaiset inversiot tunnistuvat kehitetyllä menetelmällä hyvin. Kehitettyjä menetelmiä sovellettiin useasta eri ihmispopulaatiosta kerätyn Hapmap-aineiston analysointiin. Tuloksena menetelmät antoivat jo aiemmin tunnettuja inversioita ja deletioita sekä uusia ehdokasalueita kokeellista validointia varten

    Recombination between heterologous human acrocentric chromosomes

    Get PDF
    The short arms of the human acrocentric chromosomes 13, 14, 15, 21 and 22 (SAACs) share large homologous regions, including ribosomal DNA repeats and extended segmental duplications1,2. Although the resolution of these regions in the first complete assembly of a human genome—the Telomere-to-Telomere Consortium’s CHM13 assembly (T2T-CHM13)—provided a model of their homology3, it remained unclear whether these patterns were ancestral or maintained by ongoing recombination exchange. Here we show that acrocentric chromosomes contain pseudo-homologous regions (PHRs) indicative of recombination between non-homologous sequences. Utilizing an all-to-all comparison of the human pangenome from the Human Pangenome Reference Consortium4 (HPRC), we find that contigs from all of the SAACs form a community. A variation graph5 constructed from centromere-spanning acrocentric contigs indicates the presence of regions in which most contigs appear nearly identical between heterologous acrocentric chromosomes in T2T-CHM13. Except on chromosome 15, we observe faster decay of linkage disequilibrium in the pseudo-homologous regions than in the corresponding short and long arms, indicating higher rates of recombination6,7. The pseudo-homologous regions include sequences that have previously been shown to lie at the breakpoint of Robertsonian translocations8, and their arrangement is compatible with crossover in inverted duplications on chromosomes 13, 14 and 21. The ubiquity of signals of recombination between heterologous acrocentric chromosomes seen in the HPRC draft pangenome suggests that these shared sequences form the basis for recurrent Robertsonian translocations, providing sequence and population-based confirmation of hypotheses first developed from cytogenetic studies 50 years ago9.Our work depends on the HPRC draft human pangenome resource established in the accompanying Article4, and we thank the production and assembly groups for their efforts in establishing this resource. This work used the computational resources of the UTHSC Octopus cluster and NIH HPC Biowulf cluster. We acknowledge support in maintaining these systems that was critical to our analyses. The authors thank M. Miller for the development of a graphical synopsis of our study (Fig. 5); and R. Williams and N. Soranzo for support and guidance in the design and discussion of our work. This work was supported, in part, by National Institutes of Health/NIDA U01DA047638 (E.G.), National Institutes of Health/NIGMS R01GM123489 (E.G.), NSF PPoSS Award no. 2118709 (E.G. and C.F.), the Tennessee Governor’s Chairs programme (C.F. and E.G.), National Institutes of Health/NCI R01CA266339 (T.P., L.G.d.L. and J.L.G.), and the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health (A.R., S.K. and A.M.P.). We acknowledge support from Human Technopole (A.G.), Consiglio Nazionale delle Ricerche, Italy (S.B. and V.C.), and Stowers Institute for Medical Research (T.P., L.G.d.L., B.R. and J.L.G.).Peer Reviewed"Article signat per 13 autors/es: Andrea Guarracino, Silvia Buonaiuto, Leonardo Gomes de Lima, Tamara Potapova, Arang Rhie, Sergey Koren, Boris Rubinstein, Christian Fischer, Human Pangenome Reference Consortium, Jennifer L. Gerton, Adam M. Phillippy, Vincenza Colonna & Erik Garrison " Human Pangenome Reference Consortium: "Haley J. Abel, Lucinda L. Antonacci-Fulton, Mobin Asri, Gunjan Baid, Carl A. Baker, Anastasiya Belyaeva, Konstantinos Billis, Guillaume Bourque, Silvia Buonaiuto, Andrew Carroll, Mark J. P. Chaisson, Pi-Chuan Chang, Xian H. Chang, Haoyu Cheng, Justin Chu, Sarah Cody, Vincenza Colonna, Daniel E. Cook, Robert M. Cook-Deegan, Omar E. Cornejo, Mark Diekhans, Daniel Doerr, Peter Ebert, Jana Ebler, Evan E. Eichler, Jordan M. Eizenga, Susan Fairley, Olivier Fedrigo, Adam L. Felsenfeld, Xiaowen Feng, Christian Fischer, Paul Flicek, Giulio Formenti, Adam Frankish, Robert S. Fulton, Yan Gao, Shilpa Garg, Erik Garrison, Nanibaa’ A. Garrison, Carlos Garcia Giron, Richard E. Green, Cristian Groza, Andrea Guarracino, Leanne Haggerty, Ira Hall, William T. Harvey, Marina Haukness, David Haussler, Simon Heumos, Glenn Hickey, Kendra Hoekzema, Thibaut Hourlier, Kerstin Howe, Miten Jain, Erich D. Jarvis, Hanlee P. Ji, Eimear E. Kenny, Barbara A. Koenig, Alexey Kolesnikov, Jan O. Korbel, Jennifer Kordosky, Sergey Koren, HoJoon Lee, Alexandra P. Lewis, Heng Li, Wen-Wei Liao, Shuangjia Lu, Tsung-Yu Lu, Julian K. Lucas, Hugo Magalhães, Santiago Marco-Sola, Pierre Marijon, Charles Markello, Tobias Marschall, Fergal J. Martin, Ann McCartney, Jennifer McDaniel, Karen H. Miga, Matthew W. Mitchell, Jean Monlong, Jacquelyn Mountcastle, Katherine M. Munson, Moses Njagi Mwaniki, Maria Nattestad, Adam M. Novak, Sergey Nurk, Hugh E. Olsen, Nathan D. Olson, Benedict Paten, Trevor Pesout, Adam M. Phillippy, Alice B. Popejoy, David Porubsky, Pjotr Prins, Daniela Puiu, Mikko Rautiainen, Allison A. Regier, Arang Rhie, Samuel Sacco, Ashley D. Sanders, Valerie A. Schneider, Baergen I. Schultz, Kishwar Shafin, Jonas A. Sibbesen, Jouni Sirén, Michael W. Smith, Heidi J. Sofia, Ahmad N. Abou Tayoun, Françoise Thibaud-Nissen, Chad Tomlinson, Francesca Floriana Tricomi, Flavia Villani, Mitchell R. Vollger, Justin Wagner, Brian Walenz, Ting Wang, Jonathan M. D. Wood, Aleksey V. Zimin & Justin M. Zook"Postprint (published version

    Computational insights into the generation of chromosomal copy number changes

    Get PDF
    Deviations from a diploid configuration of the human genome, spanning single genes or entire chromosomes, can have wide-ranging impacts on the variation of human phenotypes, including Mendelian and complex forms of diseases. These chromosomal alterations — such as duplications, deletions or copy-neutral loss-of-heterozygosity — are thus important forms of genetic variation for phenotyping populations of individuals as well as populations of cells. Indeed, copy number variants (CNVs) serve as hallmarks of critical changes in the development of particular diseases such as cancer and thus may be used as biomarkers. These CNVs may be either inherited (transmitted by germ cells, originating in meiosis; “germline”) or acquired (originating in mitosis; “somatic mosaicism”). The complex structure and the diverse mechanisms generating CNVs have been studied molecularly, but this has generally not been attempted using population data. This dissertation seeks to provide insights into CNV diversity in two complementary settings: 1) the genesis of germline copy number duplications, and 2) the diversity of acquired CNVs within distinct tumor tissues. First, we develop a novel method to disentangle the haplotype (the specific alleles on an inherited chromosome) composition of de novo germline duplications to characterize the “grandparental origin” of the extra piece of a chromosome. Using large family-based genome-wide association study data, we report the ratio of “bi-allelic” duplications, from inter-chromatid non-allelic homologous recombination (NAHR), to “tri-allelic” duplications, from inter-chromosomal NAHR, as 1.07:1. In addition, our method reveals a third configuration, consisting of both tri-allelic and bi-allelic duplications, which we hypothesize arose from spontaneous inter-chromosomal and inter-chromatid NAHR. The rate of these complex duplications among all the de novo duplications is 6%. Second, we assess tumor heterogeneity of biphasic uterine carcinosarcoma (UCS) from 10 patients by analyzing the data of component-specific tumor samples (carcinomatous, sarcomatous, and normal uterine tissues), generated from multiple platforms (SNP array, DNA target sequencing, and whole transcriptome sequencing). We augment the quantification of tumor heterogeneity by considering the haplotype information within the somatic copy number alterations for each sample to more precisely annotate recurrent copy number changes. Our results imply that the carcinomatous and the sarcomatous components in UCS originate from the same clone and the heterogeneity reflects relatively advanced stages. Our work confirms that profiling of carcinomas and sarcomas separately may offer clinical utility. Overall, this dissertation shows the potential utility of incorporating haplotype information in particular settings in population science and cancer biology

    Analysis and Visualization of Local Phylogenetic Structure within Species

    Get PDF
    While it is interesting to examine the evolutionary history and phylogenetic relationship between species, for example, in a sort of tree of life, there is also a great deal to be learned from examining population structure and relationships within species. A careful description of phylogenetic relationships within species provides insights into causes of phenotypic variation, including disease susceptibility. The better we are able to understand the patterns of genotypic variation within species, the better these populations may be used as models to identify causative variants and possible therapies, for example through targeted genome-wide association studies (GWAS). My thesis describes a model of local phylogenetic structure, how it can be effectively derived under various circumstances, and useful applications and visualizations of this model to aid genetic studies. I introduce a method for discovering phylogenetic structure among individuals of a population by partitioning the genome into a minimal set of intervals within which there is no evidence of recombination. I describe two extensions of this basic method. The first allows it to be applied to heterozygous, in addition to homozygous, genotypes and the second makes it more robust to errors in the source genotypes. I demonstrate the predictive power of my local phylogeny model using a novel method for genome-wide genotype imputation. This imputation method achieves very high accuracy - on the order of the accuracy rate in the sequencing technology - by imputing genotypes in regions of shared inheritance based on my local phylogenies. Comparative genomic analysis within species can be greatly aided by appropriate visualization and analysis tools. I developed a framework for web-based visualization and analysis of multiple individuals within a species, with my model of local phylogeny providing the underlying structure. I will describe the utility of these tools and the applications for which they have found widespread use.Doctor of Philosoph
    corecore