254 research outputs found

    Algorithms For Haplotype Inference And Block Partitioning

    Get PDF
    The completion of the human genome project in 2003 paved the way for studies to better understand and catalog variation in the human genome. The International HapMap Project was started in 2002 with the aim of identifying genetic variation in the human genome and studying the distribution of genetic variation across populations of individuals. The information collected by the HapMap project will enable researchers in associating genetic variations with phenotypic variations. Single Nucleotide Polymorphisms (SNPs) are loci in the genome where two individuals differ in a single base. It is estimated that there are approximately ten million SNPs in the human genome. These ten million SNPS are not completely independent of each other - blocks (contiguous regions) of neighboring SNPs on the same chromosome are inherited together. The pattern of SNPs on a block of the chromosome is called a haplotype. Each block might contain a large number of SNPs, but a small subset of these SNPs are sufficient to uniquely dentify each haplotype in the block. The haplotype map or HapMap is a map of these haplotype blocks. Haplotypes, rather than individual SNP alleles are expected to effect a disease phenotype. The human genome is diploid, meaning that in each cell there are two copies of each chromosome - i.e., each individual has two haplotypes in any region of the chromosome. With the current technology, the cost associated with empirically collecting haplotype data is prohibitively expensive. Therefore, the un-ordered bi-allelic genotype data is collected experimentally. The genotype data gives the two alleles in each SNP locus in an individual, but does not give information about which allele is on which copy of the chromosome. This necessitates computational techniques for inferring haplotypes from genotype data. This computational problem is called the haplotype inference problem. Many statistical approaches have been developed for the haplotype inference problem. Some of these statistical methods have been shown to be reasonably accurate on real genotype data. However, these techniques are very computation-intensive. With the international HapMap project collecting information from nearly 10 million SNPs, and with association studies involving thousands of individuals being undertaken, there is a need for more efficient methods for haplotype inference. This dissertation is an effort to develop efficient perfect phylogeny based combinatorial algorithms for haplotype inference. The perfect phylogeny haplotyping (PPH) problem is to derive a set of haplotypes for a given set of genotypes with the condition that the haplotypes describe a perfect phylogeny. The perfect phylogeny approach to haplotype inference is applicable to the human genome due to the block structure of the human genome. An important contribution of this dissertation is an optimal O(nm) time algorithm for the PPH problem, where n is the number of genotypes and m is the number of SNPs involved. The complexity of the earlier algorithms for this problem was O(nm^2). The O(nm) complexity was achieved by applying some transformations on the input data and by making use of the FlexTree data structure that has been developed as part of this dissertation work, which represents all the possible PPH solution for a given set of genotypes. Real genotype data does not always admit a perfect phylogeny, even within a block of the human genome. Therefore, it is necessary to extend the perfect phylogeny approach to accommodate deviations from perfect phylogeny. Deviations from perfect phylogeny might occur because of recombination events and repeated or back mutations (also referred to as homoplasy events). Another contribution of this dissertation is a set of fixed-parameter tractable algorithms for constructing near-perfect phylogenies with homoplasy events. For the problem of constructing a near perfect phylogeny with q homoplasy events, the algorithm presented here takes O(nm^2+m^(n+m)) time. Empirical analysis on simulated data shows that this algorithm produces more accurate results than PHASE (a popular haplotype inference program), while being approximately 1000 times faster than phase. Another important problem while dealing real genotype or haplotype data is the presence of missing entries. The Incomplete Perfect Phylogeny (IPP) problem is to construct a perfect phylogeny on a set of haplotypes with missing entries. The Incomplete Perfect Phylogeny Haplotyping (IPPH) problem is to construct a perfect phylogeny on a set of genotypes with missing entries. Both the IPP and IPPH problems have been shown to be NP-hard. The earlier approaches for both of these problems dealt with restricted versions of the problem, where the root is either available or can be trivially re-constructed from the data, or certain assumptions were made about the data. We make some novel observations about these problems, and present efficient algorithms for unrestricted versions of these problems. The algorithms have worst-case exponential time complexity, but have been shown to be very fast on practical instances of the problem

    Circumstances in which parsimony but not compatibility will be provably misleading

    Full text link
    Phylogenetic methods typically rely on an appropriate model of how data evolved in order to infer an accurate phylogenetic tree. For molecular data, standard statistical methods have provided an effective strategy for extracting phylogenetic information from aligned sequence data when each site (character) is subject to a common process. However, for other types of data (e.g. morphological data), characters can be too ambiguous, homoplastic or saturated to develop models that are effective at capturing the underlying process of change. To address this, we examine the properties of a classic but neglected method for inferring splits in an underlying tree, namely, maximum compatibility. By adopting a simple and extreme model in which each character either fits perfectly on some tree, or is entirely random (but it is not known which class any character belongs to) we are able to derive exact and explicit formulae regarding the performance of maximum compatibility. We show that this method is able to identify a set of non-trivial homoplasy-free characters, when the number nn of taxa is large, even when the number of random characters is large. By contrast, we show that a method that makes more uniform use of all the data --- maximum parsimony --- can provably estimate trees in which {\em none} of the original homoplasy-free characters support splits.Comment: 37 pages, 2 figure

    Systematics and cophylogenetics of toucans and their associated chewing lice

    Get PDF
    Historically, comparisons of host and parasite phylogenies have concentrated on cospeciation. However, many of these comparisons have demonstrated that the phylogenies of hosts and parasites are seldom completely congruent, suggesting that phenomena other than cospeciation play an important role in the evolution of host-parasite assemblages. Other coevolutionary phenomena, such as host switching, parasite duplication (speciation on the host), sorting (extinction), and failure to speciate can also influence host-parasite assemblages. In this dissertation I explore several aspects of the evolutionary history of Ramphastos toucans and their ectoparasitic chewing lice using molecular phylogenetic and cophylogenetic reconstructions. First, using mitochondrial DNA sequences, I reconstructed the phylogeny of the Ramphastos toucans. I used this phylogeny to assess whether the striking similarity in plumage and bare-part coloration of sympatric Ramphastos is due to convergence or shared ancestry. Ancestral character state reconstructions indicate that that at least half of the instances of similarity in plumage and bare-part coloration between sympatric Ramphastos are due to homoplasy. Second, using mitochondrial and nuclear protein-coding DNA sequences, I reconstructed the phylogeny of ectoparasitic toucan chewing lice in the Austrophilopterus cancellosus subspecies complex, and compared this phylogeny to the phylogeny of the hosts to reconstruct the history of coevolutionary events in this host-parasite assemblage. Three salient findings emerged. (1) reconstructions of host and louse phylogenies indicate that they do not branch in parallel and that their cophylogenetic history shows little or no significant cospeciation. (2) members of monophyletic Austrophilopterus toucan louse lineages are not necessarily restricted to monophyletic host lineages. Often, closely related lice are found on more distantly related, but sympatric, toucan hosts. (3) the geographic distribution of the hosts apparently plays a role in the speciation of these lice. These results suggest that for some louse lineages, biogeography may be more important than host associations in structuring louse populations and species. This is particularly true in cases where host life history (e.g. hole-nesting) or parasite life history (e.g. phoresis) might promote frequent host switching events between syntopic host species. These findings highlight the importance of integrating biogeographic information into cophylogenetic studies

    The Binary Perfect Phylogeny with Persistent characters

    Get PDF
    The binary perfect phylogeny model is too restrictive to model biological events such as back mutations. In this paper we consider a natural generalization of the model that allows a special type of back mutation. We investigate the problem of reconstructing a near perfect phylogeny over a binary set of characters where characters are persistent: characters can be gained and lost at most once. Based on this notion, we define the problem of the Persistent Perfect Phylogeny (referred as P-PP). We restate the P-PP problem as a special case of the Incomplete Directed Perfect Phylogeny, called Incomplete Perfect Phylogeny with Persistent Completion, (refereed as IP-PP), where the instance is an incomplete binary matrix M having some missing entries, denoted by symbol ?, that must be determined (or completed) as 0 or 1 so that M admits a binary perfect phylogeny. We show that the IP-PP problem can be reduced to a problem over an edge colored graph since the completion of each column of the input matrix can be represented by a graph operation. Based on this graph formulation, we develop an exact algorithm for solving the P-PP problem that is exponential in the number of characters and polynomial in the number of species.Comment: 13 pages, 3 figure

    Widespread Discordance of Gene Trees with Species Tree in Drosophila: Evidence for Incomplete Lineage Sorting

    Get PDF
    The phylogenetic relationship of the now fully sequenced species Drosophila erecta and D. yakuba with respect to the D. melanogaster species complex has been a subject of controversy. All three possible groupings of the species have been reported in the past, though recent multi-gene studies suggest that D. erecta and D. yakuba are sister species. Using the whole genomes of each of these species as well as the four other fully sequenced species in the subgenus Sophophora, we set out to investigate the placement of D. erecta and D. yakuba in the D. melanogaster species group and to understand the cause of the past incongruence. Though we find that the phylogeny grouping D. erecta and D. yakuba together is the best supported, we also find widespread incongruence in nucleotide and amino acid substitutions, insertions and deletions, and gene trees. The time inferred to span the two key speciation events is short enough that under the coalescent model, the incongruence could be the result of incomplete lineage sorting. Consistent with the lineage-sorting hypothesis, substitutions supporting the same tree were spatially clustered. Support for the different trees was found to be linked to recombination such that adjacent genes support the same tree most often in regions of low recombination and substitutions supporting the same tree are most enriched roughly on the same scale as linkage disequilibrium, also consistent with lineage sorting. The incongruence was found to be statistically significant and robust to model and species choice. No systematic biases were found. We conclude that phylogenetic incongruence in the D. melanogaster species complex is the result, at least in part, of incomplete lineage sorting. Incomplete lineage sorting will likely cause phylogenetic incongruence in many comparative genomics datasets. Methods to infer the correct species tree, the history of every base in the genome, and comparative methods that control for and/or utilize this information will be valuable advancements for the field of comparative genomics

    Classification: More than Just Branching Patterns of Evolution

    Get PDF
    The past 35 years in biological systematics have been a time of remarkable philosophical and methodological developments. For nearly a century after Darwin\u27s Origin of Species, systematists worked to understand the diversity of nature based on evolutionary relationships. Numerous concepts were presented and elaborated upon, such as homology, parallelism, divergence, primitiveness and advancedness, cladogenesis and anagenesis. Classifications were based solidly on phylogenetic concepts; they were avowedly monophyletic. Phenetics emphasized the immense challenges represented by phylogeny reconstruction and advised against basing classifications upon it. Pheneticists forced reevaluation of all previous classificatory efforts, and objectivity and repeatability in both grouping and ranking were stressed. The concept of character state was developed, and numerous debates focused on other concepts, such as unit character, homology, similarity, and distance. The simultaneous availability of computers allowed phenetics to explore new limits. Despite numerous positive aspects of phenetics, the near absence of evolutionary insights led eventually to cladistics. Drawing directly from phenetics and from the Hennigian philosophical school, cladistics evolved as an explicit means of deriving branching patterns of phylogeny and upon which classifications might be based. Two decades of cladistics have given us: refined arguments on homology and the evolutionary content of characters and states, views of classifications as testable hypotheses, and computer algorithms for constructing branching patterns of evolution. In contrast to the phenetic movement, which was noteworthy for seeking newer concepts and methods, even including determining evolutionary relationships (which led eventually to numerical cladistics), many cladists have solidified their approaches based on parsimony, outgroups, and holophyly. Instead of looking for newer ways to represent phylogeny, some cladists have attempted to use branching patterns: (1) as a strict basis for biological classification and nomenclature and (2) to explain the origin of biological diversity even down to the populational level. This paper argues that cladistics is inappropriate to both these goals due to: (1) inability of branching patterns to reveal all significant dimensions of phylogeny; (2) acknowledged patterns of reticulate evolution, especially in flowering plants; (3) documented nonparsimonious pathways of evolution: and (4) nondichotomous distribution of genetic variation within populations. New concepts and methods of reconstructing phylogeny and developing classifications must be sought. Most important is incorporation of genetic-based evolutionary divergence within lineages for purposes of grouping and ranking

    The comparative genomics and complex population history of Papio baboons

    Get PDF
    corecore