19 research outputs found

    Circumstances in which parsimony but not compatibility will be provably misleading

    Full text link
    Phylogenetic methods typically rely on an appropriate model of how data evolved in order to infer an accurate phylogenetic tree. For molecular data, standard statistical methods have provided an effective strategy for extracting phylogenetic information from aligned sequence data when each site (character) is subject to a common process. However, for other types of data (e.g. morphological data), characters can be too ambiguous, homoplastic or saturated to develop models that are effective at capturing the underlying process of change. To address this, we examine the properties of a classic but neglected method for inferring splits in an underlying tree, namely, maximum compatibility. By adopting a simple and extreme model in which each character either fits perfectly on some tree, or is entirely random (but it is not known which class any character belongs to) we are able to derive exact and explicit formulae regarding the performance of maximum compatibility. We show that this method is able to identify a set of non-trivial homoplasy-free characters, when the number nn of taxa is large, even when the number of random characters is large. By contrast, we show that a method that makes more uniform use of all the data --- maximum parsimony --- can provably estimate trees in which {\em none} of the original homoplasy-free characters support splits.Comment: 37 pages, 2 figure

    Evolutionary trees: an integer multicommodity max-flow-min-cut theorem

    Get PDF
    In biomathematics, the extensions of a leaf-colouration of a binary tree to the whole vertex set with minimum number of colour-changing edges are extensively studied. Our paper generalizes the problem for trees; algorithms and a Menger-type theorem are presented. The LP dual of the problem is a multicommodity flow problem, for which a max-flow-min-cut theorem holds. The problem that we solve is an instance of the NP-hard multiway cut problem

    Adaptive Randomized Rounding in the Big Parsimony Problem

    Get PDF

    Deep conservation of human protein tandem repeats within the eukaryotes

    Get PDF
    Tandem repeats (TRs) are a major element of protein sequences in all domains of life. They are particularly abundant in mammals, where by conservative estimates one in three proteins contain a TR. High generation-scale duplication and deletion rates were reported for nucleic TR units. However, it is not known whether protein TR units can also be frequently lost or gained providing a source of variation for rapid adaptation of protein function, or alternatively, tend to have conserved TR unit configurations over long evolutionary times. To obtain a systematic picture for proteins TRs, we performed a proteome-wide analysis of the mode of evolution for human TRs. For this purpose, we propose a novel method for the detection of orthologous TRs based on circular profile hidden Markov models. For all detected TRs we reconstructed bi-species TR unit phylogenies across 61 eukaryotes ranging from human to yeast. Moreover, we performed additional analyses to correlate functional and structural annotations of human TRs with their mode of evolution. Surprisingly, we find that the vast majority of human TRs are ancient, with TR unit number and order preserved intact since distant speciation events. For example, ≥61% of all human TRs have been strongly conserved at least since the root of all mammals, approximately 300 Mya ago. Further, we find no human protein TR that shows evidence for strong recent duplications and deletions. The results are in contrast to high generation-scale mutability of nucleic TRs. Presumably, most protein TRs fold into stable and conserved structures that are indispensable for the function of the TR-containing protein. All of our data and results are available for download from http://www.atgc-montpellier.fr/TRE

    Deep conservation of human protein tandem repeats within the eukaryotes

    Get PDF
    Tandem repeats (TRs) are a major element of protein sequences in all domains of life. They are particularly abundant in mammals, where by conservative estimates one in three proteins contain a TR. High generation-scale duplication and deletion rates were reported for nucleic TR units. However, it is not known whether protein TR units can also be frequently lost or gained providing a source of variation for rapid adaptation of protein function, or alternatively, tend to have conserved TR unit configurations over long evolutionary times. To obtain a systematic picture for proteins TRs, we performed a proteome-wide analysis of the mode of evolution for human TRs. For this purpose, we propose a novel method for the detection of orthologous TRs based on circular profile hidden Markov models. For all detected TRs we reconstructed bi-species TR unit phylogenies across 61 eukaryotes ranging from human to yeast. Moreover, we performed additional analyses to correlate functional and structural annotations of human TRs with their mode of evolution. Surprisingly, we find that the vast majority of human TRs are ancient, with TR unit number and order preserved intact since distant speciation events. For example, ≥61% of all human TRs have been strongly conserved at least since the root of all mammals, approximately 300 Mya ago. Further, we find no human protein TR that shows evidence for strong recent duplications and deletions. The results are in contrast to high generation-scale mutability of nucleic TRs. Presumably, most protein TRs fold into stable and conserved structures that are indispensable for the function of the TR-containing protein. All of our data and results are available for download from http://www.atgc-montpellier.fr/TRE

    Evolutionary patterns in Thamnochortus (Restionaceae) : a study of specification in the Cape floristic region

    Get PDF
    Bibliography: pages 85-92.Patterns of speciation and potential evolutionary pressures and constraints were investigated in the genus Thamnochortus. Phenetic methods were used to define boundaries of species prior to cladistic analyses. Comparative techniques were employed to investigate aspects of dispersal biology and fire survival habit. Methods of historical biogeography were used to evaluate vicariance and dispersal hypotheses. The broader understanding of species evolution gained in such a comparative study is important in conservation of species or areas, forming a basis for further ecological and genetic predictions. The majority of Thamnochortus species have well-defined species limits; however, those of T. comptonii, T. platypteris and T. scabridus are more diffuse. For this species complex a matrix of 94 specimens, nine quantitative and sixteen qualitative characters was investigated, using cluster and ordination analyses, to define species boundaries. Thirty-four species of Thamnochortus, with three species of Rhodocoma as the outgroup, were used in the cladistic analysis. There were forty-three qualitative characters and ten quantitative characters. The number of species, height, reproductive output and geographic area were compared between sister lineages of seeding and resprouting species. In species classified as resprouters individuals survive fire by resprouting from the rhizome. In a post-fire environment seeding species recruit from seed and not by resprouting. Resprouters were significantly taller than seeders and covered a significantly larger distribution area. There was no significant difference in the amount of seed produced by seeding and resprouting lineages or in the geographic area covered by winged and keeled lineages. Correlated evolution tests indicated that wings of seeds evolved independently of the seeding condition, although the probability of wings evolving randomly was low. The evolution of keels was significantly associated with a switch to resprouting. There are few distinct ecological differences between the seeding and resprouting habits in soil type or rainfall; however, the inference is that resprouters do occupy habitats in higher rainfall areas than the sister seeders. Biogeographic analysis of species distributions, using cluster methods with a Jacard similarity coefficient, defined four phytogeographic areas which were considered to be areas of endemism. A concentric ring method recognised narrow areas of endemism and illustrated the overlap of species distributions between areas. The defined areas of endemism and similarity were used in general area cladograms to determine area relationships. The primary differentiations on the general area cladogram of areas of similarity distinguished a summer rainfall region (south coast) from a winter rainfall region (south Western Cape extending up the west coast). Within the winter rainfall region there is separation into a mesic (Cape Peninsula and south western mountain range) and an arid region (Cedarberg and Koue Bokkeveld). This analysis of Thamnochortus gives the first indication that the primary differentiation was between summer and winter rainfall, followed by the differentiation of the winter rainfall region into mesic and arid areas. Comparison within clades of distribution and habitat profiles indicated that, where distributions of closely related species overlap, there is niche differentiation in flowering time and substrate texture. Fire survival habit does not appear to have influenced speciation in Thamnochortus. There is, however, an evolutionary relationship between fire survival habit and female outer tepal specialization. Evidence from the general area cladogram indicates that speciation patterns in Thamnochortus may have been influenced by changes in rainfall in the Miocene. Habitat profiles of sister species indicate that alterations in flowering time and substrate texture are key factors in ecological differentiation of species

    Three mathematical issues in reconstructing ancestral genome

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Postprocessing phylogenies

    Get PDF
    Es werden immer mehr phylogenetische Bäume berechnet. Die berechneten Verwandtschaften zwischen den Arten können sich allerdings widersprechen. In diesem Fall sind Werkzeuge notwendig, welche die Höhe des Unterschiedes berechnen, die Gemeinsamkeiten zweier Bäume extrahieren und mehrere Bäume zusammenfassen indem sie die Unterschiede minimieren. Diese Werkzeuge werden unter dem Begriff ``Phylogenetic Postprocessing'' zusammengefasst. In dieser Arbeit werden zwei Aspekte des Phylogenetischen Postprocessings im Detail untersucht. Zuerst werden Baumdistanzen untersucht. Diese evaluieren den Unterschied zweier Bäume. Die meisten Maße berücksichtigen dabei nur die topologische Information. Allerdings tragen auch die Kantenlängen der Bäume Informationen, da sie z.B. eine Schätzung der Menge an Unterschied zwischen zwei Sequenzen sind. Ein Maß, welches sowohl die Topologie als auch die Kantenlängen berücksichtigt, ist die Länge des kürzesten Weges durch den Raum aller Bäume mit Kantenlängen. Dies ist die geodätische Distanz. Hier präsentieren wir einen exakten Algorithmus um die geodätische Distanz zu berechnen, der in exponentieller Zeit läuft. Vergleiche mit ihren Approximationen zeigen, dass es einen bestimmten Weg gibt, der die geodätische Distanz gut annähert und in linearer Zeit berechnet werden kann. Phylogenetische Bäume können auch daraufhin untersucht werden, ob sie statistisch ähnlich oder unterschiedlich sind. Dabei kann ein topologisches Distanzmaß als Teststatistik verwendet und die assoziierten p-Werte werden unter einer Nullverteilung der Bäume berechnet werden. Bei diskreten Testverfahren, muss allerdings die Testgröße konservativ gewählt werden, d.h. sie darf das Signifikanzniveau nicht überschreiten. Wir zeigen ein Beispiel auf, bei dem ein Test abgeändert werden muss um dies zu gewährleisten. Der zweite Aspekt ist die Kombination von Bäumen oder allgemein phylogenetischen Datensätzen. Genbäume mit sich überschneidenden Artenmengen können zu einem sogenannten Supertree zusammengefügt werden. Eine andere Möglichkeit ist bereits die Genalignments zu kombinieren. Dabei werden die Genalignments aneinandergehangen, d.h. zu einem sogenannten Superalignment kombiniert. Anschließend wird eine Phylogenie aus diesem langen Alignment berechnet. Es gibt auch die dritte Möglichkeit, die Daten auf einer Stufe zwischen Superalignment und Supertree zu kombinieren. Mit Hilfe von Simulationen von Genalignments entlang Modellbäumen können Methoden von diesen drei Stufen verglichen werden. Wir untersuchen verschiedene Parameter, z.B. vollständige oder sich überschneidende Artenmengen, gleiche oder unterschiedliche Substitutionsparameter oder unterschiedliche Gentopologien. Die Simulationen zeigen gute Ergebnisse der Matrix-Representation-Methoden im Vergleich zu anderen Supertreemethoden. Weiterhin ist Superalignment gut geeignet bei unterschiedlichen Parametern zwischen den Genen, aber problematisch wenn es viele Unterschiede zwischen den wahren Genbäumen gibt. Zusätzlich zu diesem praktischen Vergleich von Supertreemethoden sind auch theoretische und praktische Aspekte von Interesse. Daher untersuchen wir die Nullmodelle, die der Supertreerekonstruktion zugrunde liegen. Ein solches Nullmodell ist die Gleichverteilung der Splits, also jeder möglichen Unterteilung der Arten in zwei Mengen. Es stellt sich heraus, dass nur diese Verteilung angemessene Eigenschaften hat, wenn wenig Information vorhanden ist. Ein zweites Nullmodell ist die Gleichverteilung der Bäume. Diese fügt allerdings eine Verzerrung zugunsten bestimmter Baumstrukturen in splitbasierte Supertreemethoden ein. Diese Verzerrung kann auf die ungleiche Verteilung der Splits in diesem Nullmodell zurückgeführt werden. Schließlich kann ein Supertree auch als Median-Tree definiert werden, also als Baum, der die totale Distanz zu allen Bäumen in der Menge minimiert. Der Majority-Rule Consensus wurde als Median-Tree-Methode für Bäume mit gleichen Artenmengen beschrieben. Für Bäume mit sich überschneidenden Artenmengen gibt als allerdings unterschiedliche Ausprägungen, und zwar MR(-)supertrees und MR(+)supertrees. Wir präsentieren Algorithmen um die entsprechenden Distanzen im Matrix-Representation-Framework zu berechnen. Durch die Anwendung ihrer Implementierungen auf simulierte Datensätze sehen wir deutlich bessere Ergebnisse für MR(-) im Vergleich zu MR(+). Es ist naheliegend diesen Unterschied auf eine Verzerrung zugunsten bestimmter Baumstrukturen in MR(+) zurückzuführen. Zusammenfassend sehen wir, dass die zwei Aspekte des Phylogenetischen Postprocessings, also Baumdistanzen und Baumkombinationsmethoden, nicht unabhängig sind, sondern durch die Definition des Median-Trees verbunden. Daher wird unser Verständnis von Baumdistanzen auch die Kombination von Bäumen beeinflussen und umgekehrt.More and more phylogenetic trees are generated, and it frequently occurs that the inferred relationships contradict each other. In this case, tools are necessary which evaluate the amount of difference between two trees, extract the congruencies of two trees, and combine multiple trees by minimizing the incongruencies. These tools are summarized by the term ``phylogenetic postprocessing''. In this thesis, two aspects of phylogenetic postprocessing are investigated in detail. First, tree distance computations evaluate the amount of difference between two trees. Most measures only take the topological information into account. There are a few measures that additionally focus on the branch lengths of the trees. One of these is the length of the shortest path in the space of weighted trees, also known as the geodesic distance. Here, an exact, but exponential-time, algorithm to compute the geodesic distance is presented. Comparisons with its approximations show that there is a particular path that approximates the geodesic distance well and that can be computed in linear time. Phylogenetic trees can also be tested for being statistically similar or different. Then a topological distance measure can be used as a test statistic where the associated p-value is computed under a null distribution of trees. Discrete tests must ensure that the size of the test is conservative, i.e. the size must not exceed the significance level. We present one example where a test has to be modified to ensure this property. Second, gene trees on overlapping taxon sets can be combined into a so-called supertree. Another possibility is to combine the gene alignments directly, namely, to concatenate the gene alignments into a superalignment and to reconstruct a phylogeny from this long alignment. There is also the possibility to combine the data at a level between superalignment and supertree methods. Simulations of gene alignments along model gene trees allow for the comparison of methods from all three levels. We investigate different settings, e.g. complete or overlapping taxon sets, equal or different substitution parameters or different gene topologies. The results show a good performance of matrix representation methods compared to other supertree and medium-level methods. Furthermore, superalignment is well applicable in the case of differing parameters between genes but is problematic when a high level of incongruence is present among the true gene trees. Additionally to the practical evaluation of supertree methods, theoretical and algorithmic aspects are of interest. Therefore we study different null models underlying supertree reconstruction. We find only the distribution of equally likely splits to behave in an appropriate way if little information is present. In contrast, the distribution of equally likely trees inserts a tree shape bias in split-based supertree methods. This bias can be traced back to the unequal split distribution in the null model. Finally, a supertree can also be defined by minimizing the total distance to the trees in the set, i.e. as a median tree. The majority-rule consensus is described as a median tree method for trees on the same taxon set. For trees on overlapping taxon sets, however, different specifications can be used, namely MR(-)supertrees and MR(+)supertrees. We present algorithms to compute the respective distances in the matrix representation framework. Applying their implementation to simulated data sets shows a clearly better performance of MR(-) compared to MR(+). This discrepancy is likely to trace back to a tree shape bias in MR(+). To conclude, we see that the two aspect of phylogenetic postprocessing, tree distances and tree combination methods, are not independent. Instead, they are linked by the definition of the median tree. Thus our understanding of tree distances influences data combination methods and vice versa

    Fast Algorithms for Large-Scale Phylogenetic Reconstruction

    Get PDF
    One of the most fundamental computational problems in biology is that of inferring evolutionary histories of groups of species from sequence data. Such evolutionary histories, known as phylogenies are usually represented as binary trees where leaves represent extant species, whereas internal nodes represent their shared ancestors. As the amount of sequence data available to biologists increases, very fast phylogenetic reconstruction algorithms are becoming necessary. Currently, large sequence alignments can contain up to hundreds of thousands of sequences, making traditional methods, such as Neighbor Joining, computationally prohibitive. To address this problem, we have developed three novel fast phylogenetic algorithms. The first algorithm, QTree, is a quartet-based heuristic that runs in O(n log n) time. It is based on a theoretical algorithm that reconstructs the correct tree, with high probability, assuming every quartet is inferred correctly with constant probability. The core of our algorithm is a balanced search tree structure that enables us to locate an edge in the tree in O(log n) time. Our algorithm is several times faster than all the current methods, while its accuracy approaches that of Neighbour Joining. The second algorithm, LSHTree, is the first sub-quadratic time algorithm with theoretical performance guarantees under a Markov model of sequence evolution. Our new algorithm runs in O(n^{1+γ(g)} log^2 n) time, where γ is an increasing function of an upper bound on the mutation rate along any branch in the phylogeny, and γ(g) < 1 for all g. For phylogenies with very short branches, the running time of our algorithm is close to linear. In experiments, our prototype implementation was more accurate than the current fast algorithms, while being comparably fast. In the final part of this thesis, we apply the algorithmic framework behind LSHTree to the problem of placing large numbers of short sequence reads onto a fixed phylogenetic tree. Our initial results in this area are promising, but there are still many challenges to be resolved
    corecore