Search CORE

Compatibility of phylogenetic trees is the most important concept underlying widely-used methods for assessing the agreement of different phylogenetic trees with overlapping taxa and combining them into common supertrees to reveal the tree of life. The notion of ancestral compatibility of phylogenetic trees with nested taxa was introduced by Semple et al in 2004. In this paper we analyze in detail the meaning of this compatibility from the points of view of the local structure of the trees, of the existence of embeddings into a common supertree, and of the joint properties of their cluster representations. Our analysis leads to a very simple polynomial-time algorithm for testing this compatibility, which we have implemented and is freely available for download from the BioPerl collection of Perl modules for computational biology.Comment: Submitte

arXiv.org e-Print Archive

CiteSeerX

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Maximum agreement and compatible supertrees

Author: Berry Vincent
Nicolas François
Publication venue: Elsevier B.V.
Publication date
Field of study

AbstractGiven a set of leaf-labelled trees with identical leaf sets, the MAST problem, respectively MCT problem, consists of finding a largest subset of leaves such that all input trees restricted to these leaves are isomorphic, respectively compatible. In this paper, we propose extensions of these problems to the context of supertree inference, where input trees have non-identical leaf sets. This situation is of particular interest in phylogenetics. The resulting problems are called SMAST and SMCT.A sufficient condition is given that identifies cases where these problems can be solved by resorting to MAST and MCT as subproblems. This condition is met, for instance, when only two input trees are considered. Then we give algorithms for SMAST and SMCT that benefit from the link with the subtree problems. These algorithms run in time linear to the time needed to solve MAST, respectively MCT, on an instance of the same or smaller size.It is shown that arbitrary instances of SMAST and SMCT can be turned in polynomial time into instances composed of trees with a bounded number of leaves.SMAST is shown to be W[2]-hard when the considered parameter is the number of input leaves that have to be removed to obtain the agreement of the input trees. A similar result holds for SMCT. Moreover, the corresponding optimization problems, that is the complements of SMAST and SMCT, cannot be approximated in polynomial time within any constant factor, unless P=NP. These results also hold when the input trees have a bounded number of leaves.The presented results apply to both collections of rooted and unrooted trees

Elsevier - Publisher Connector

Reweaving the tapestry: a supertree of birds

Author: Davis Katie E.
Publication venue
Publication date: 01/01/2008
Field of study

Supertrees are a useful method of constructing large-scale phylogenies by assembling numerous smaller phylogenies that have some, but not necessarily all, taxa in common. Birds are an obvious candidate for supertree construction as they are the most abundant land vertebrates on the planet and no comprehensive phylogeny of both extinct and extant species currently exists. In order to construct supertrees, primary analysis of characters is required. One such study, presented here, describes two new partial specimens belonging to the Primobucconidae from the Green River Formation of Wyoming (USA), which were assigned to the species Primobucco mcgrewi. Although incomplete, these specimens had preserved anatomical features not seen in other material. An attempt to further constrain their phylogenetic position was inconclusive, showing only that the Primobucconidae belong in a clade containing the extant Coraciiformes and related taxa. Over 700 such studies were used to construct a species-level supertree of Aves containing over 5000 taxa. The resulting tree shows the relationships between the main avian groups, with only a few novel clades, some of which can be explained by a lack of information regarding those taxa. The tree was constructed using a strict protocol which ensures robust, accurate and efficient data collection and processing; extending previous work by other authors. Before creating the species-level supertree the protocol was tested on the order Galliformes in order to determine the most efficient method of removing non-independent data. It was found that combining non-independent source trees via a “mini-supertree” analysis produced results more consistent with the input source data and, in addition, significantly reduced computational load. Another method for constructing large-scale trees is via a supermatrix, which is constructed from primary data collated into a single, large matrix. A molecular-only tree was constructed using both supertree and supermatrix methods, from the same data, again of the order Galliformes. Both methods performed equally as well in producing trees that fit the source data. The two methods could be considered complementary rather than conflicting as the supertree took a long time to construct but was very quick to calculate, but the supermatrix took longer to calculate, but was quicker to construct. Dependent upon the data at hand and the other factors involved, the choice of which method to use appears, from this small study, to be of little consequence. Finally an updated species-level supertree of the Dinosauria was also constructed and used to look at diversification rates in order to elucidate the “Cretaceous explosion of terrestrial life”. Results from this study show that this apparent burst in diversity at the end of the Cretaceous is a sampling artefact and in fact, dinosaurs show most of their major diversification shifts in the first third of their history

Glasgow Theses Service

OpenGrey Repository

Polynomial supertree methods in phylogenomics: algorithms, simulations and software

Author: Brinkmeyer Malte
Publication venue
Publication date: 29/08/2013
Field of study

One of the objectives in modern biology, especially phylogenetics, is to build larger clades of the Tree of Life. Large-scale phylogenetic analysis involves several serious challenges. The aim of this thesis is to contribute to some of the open problems in this context. In computational phylogenetics, supertree methods provide a way to reconstruct larger clades of the Tree of Life. We present a novel polynomial time approach for the computation of supertrees called FlipCut supertree. Our method combines the computation of minimum cuts from graph-based methods with a matrix representation method, namely Minimum Flip Supertrees. Here, the input trees are encoded in a 0/1/?-matrix. We present a heuristic to search for a minimum set of 0/1-flips such that the resulting matrix admits a directed perfect phylogeny. In contrast to other polynomial time approaches, our results can be interpreted in the sense that we try to minimize a global objective function, namely the number of flips in the input matrix. We extend our approach by using edge weights to weight the columns of the 0/1/?-matrix. In order to compare our new FlipCut supertree method with other recent polynomial supertree methods and matrix representation methods, we present a large scale simulation study using two different data sets. Our findings illustrate the trade-off between accuracy and running time in supertree construction, as well as the pros and cons of different supertree approaches. Furthermore, we present EPoS, a modular software framework for phylogenetic analysis and visualization. It fills the gap between command line-based algorithmic packages and visual tools without sufficient support for computational methods. By combining a powerful graphical user interface with a plugin system that allows simple integration of new algorithms, visualizations and data structures, we created a framework that is easy to use, to extend and that covers all important steps of a phylogenetic analysis

Digitale Bibliothek Thüringen

Postprocessing phylogenies

Author: Kupczok Anne
Publication venue
Publication date: 01/01/2010
Field of study

Es werden immer mehr phylogenetische Bäume berechnet. Die berechneten Verwandtschaften zwischen den Arten können sich allerdings widersprechen. In diesem Fall sind Werkzeuge notwendig, welche die Höhe des Unterschiedes berechnen, die Gemeinsamkeiten zweier Bäume extrahieren und mehrere Bäume zusammenfassen indem sie die Unterschiede minimieren. Diese Werkzeuge werden unter dem Begriff ``Phylogenetic Postprocessing'' zusammengefasst. In dieser Arbeit werden zwei Aspekte des Phylogenetischen Postprocessings im Detail untersucht. Zuerst werden Baumdistanzen untersucht. Diese evaluieren den Unterschied zweier Bäume. Die meisten Maße berücksichtigen dabei nur die topologische Information. Allerdings tragen auch die Kantenlängen der Bäume Informationen, da sie z.B. eine Schätzung der Menge an Unterschied zwischen zwei Sequenzen sind. Ein Maß, welches sowohl die Topologie als auch die Kantenlängen berücksichtigt, ist die Länge des kürzesten Weges durch den Raum aller Bäume mit Kantenlängen. Dies ist die geodätische Distanz. Hier präsentieren wir einen exakten Algorithmus um die geodätische Distanz zu berechnen, der in exponentieller Zeit läuft. Vergleiche mit ihren Approximationen zeigen, dass es einen bestimmten Weg gibt, der die geodätische Distanz gut annähert und in linearer Zeit berechnet werden kann. Phylogenetische Bäume können auch daraufhin untersucht werden, ob sie statistisch ähnlich oder unterschiedlich sind. Dabei kann ein topologisches Distanzmaß als Teststatistik verwendet und die assoziierten p-Werte werden unter einer Nullverteilung der Bäume berechnet werden. Bei diskreten Testverfahren, muss allerdings die Testgröße konservativ gewählt werden, d.h. sie darf das Signifikanzniveau nicht überschreiten. Wir zeigen ein Beispiel auf, bei dem ein Test abgeändert werden muss um dies zu gewährleisten. Der zweite Aspekt ist die Kombination von Bäumen oder allgemein phylogenetischen Datensätzen. Genbäume mit sich überschneidenden Artenmengen können zu einem sogenannten Supertree zusammengefügt werden. Eine andere Möglichkeit ist bereits die Genalignments zu kombinieren. Dabei werden die Genalignments aneinandergehangen, d.h. zu einem sogenannten Superalignment kombiniert. Anschließend wird eine Phylogenie aus diesem langen Alignment berechnet. Es gibt auch die dritte Möglichkeit, die Daten auf einer Stufe zwischen Superalignment und Supertree zu kombinieren. Mit Hilfe von Simulationen von Genalignments entlang Modellbäumen können Methoden von diesen drei Stufen verglichen werden. Wir untersuchen verschiedene Parameter, z.B. vollständige oder sich überschneidende Artenmengen, gleiche oder unterschiedliche Substitutionsparameter oder unterschiedliche Gentopologien. Die Simulationen zeigen gute Ergebnisse der Matrix-Representation-Methoden im Vergleich zu anderen Supertreemethoden. Weiterhin ist Superalignment gut geeignet bei unterschiedlichen Parametern zwischen den Genen, aber problematisch wenn es viele Unterschiede zwischen den wahren Genbäumen gibt. Zusätzlich zu diesem praktischen Vergleich von Supertreemethoden sind auch theoretische und praktische Aspekte von Interesse. Daher untersuchen wir die Nullmodelle, die der Supertreerekonstruktion zugrunde liegen. Ein solches Nullmodell ist die Gleichverteilung der Splits, also jeder möglichen Unterteilung der Arten in zwei Mengen. Es stellt sich heraus, dass nur diese Verteilung angemessene Eigenschaften hat, wenn wenig Information vorhanden ist. Ein zweites Nullmodell ist die Gleichverteilung der Bäume. Diese fügt allerdings eine Verzerrung zugunsten bestimmter Baumstrukturen in splitbasierte Supertreemethoden ein. Diese Verzerrung kann auf die ungleiche Verteilung der Splits in diesem Nullmodell zurückgeführt werden. Schließlich kann ein Supertree auch als Median-Tree definiert werden, also als Baum, der die totale Distanz zu allen Bäumen in der Menge minimiert. Der Majority-Rule Consensus wurde als Median-Tree-Methode für Bäume mit gleichen Artenmengen beschrieben. Für Bäume mit sich überschneidenden Artenmengen gibt als allerdings unterschiedliche Ausprägungen, und zwar MR(-)supertrees und MR(+)supertrees. Wir präsentieren Algorithmen um die entsprechenden Distanzen im Matrix-Representation-Framework zu berechnen. Durch die Anwendung ihrer Implementierungen auf simulierte Datensätze sehen wir deutlich bessere Ergebnisse für MR(-) im Vergleich zu MR(+). Es ist naheliegend diesen Unterschied auf eine Verzerrung zugunsten bestimmter Baumstrukturen in MR(+) zurückzuführen. Zusammenfassend sehen wir, dass die zwei Aspekte des Phylogenetischen Postprocessings, also Baumdistanzen und Baumkombinationsmethoden, nicht unabhängig sind, sondern durch die Definition des Median-Trees verbunden. Daher wird unser Verständnis von Baumdistanzen auch die Kombination von Bäumen beeinflussen und umgekehrt.More and more phylogenetic trees are generated, and it frequently occurs that the inferred relationships contradict each other. In this case, tools are necessary which evaluate the amount of difference between two trees, extract the congruencies of two trees, and combine multiple trees by minimizing the incongruencies. These tools are summarized by the term ``phylogenetic postprocessing''. In this thesis, two aspects of phylogenetic postprocessing are investigated in detail. First, tree distance computations evaluate the amount of difference between two trees. Most measures only take the topological information into account. There are a few measures that additionally focus on the branch lengths of the trees. One of these is the length of the shortest path in the space of weighted trees, also known as the geodesic distance. Here, an exact, but exponential-time, algorithm to compute the geodesic distance is presented. Comparisons with its approximations show that there is a particular path that approximates the geodesic distance well and that can be computed in linear time. Phylogenetic trees can also be tested for being statistically similar or different. Then a topological distance measure can be used as a test statistic where the associated p-value is computed under a null distribution of trees. Discrete tests must ensure that the size of the test is conservative, i.e. the size must not exceed the significance level. We present one example where a test has to be modified to ensure this property. Second, gene trees on overlapping taxon sets can be combined into a so-called supertree. Another possibility is to combine the gene alignments directly, namely, to concatenate the gene alignments into a superalignment and to reconstruct a phylogeny from this long alignment. There is also the possibility to combine the data at a level between superalignment and supertree methods. Simulations of gene alignments along model gene trees allow for the comparison of methods from all three levels. We investigate different settings, e.g. complete or overlapping taxon sets, equal or different substitution parameters or different gene topologies. The results show a good performance of matrix representation methods compared to other supertree and medium-level methods. Furthermore, superalignment is well applicable in the case of differing parameters between genes but is problematic when a high level of incongruence is present among the true gene trees. Additionally to the practical evaluation of supertree methods, theoretical and algorithmic aspects are of interest. Therefore we study different null models underlying supertree reconstruction. We find only the distribution of equally likely splits to behave in an appropriate way if little information is present. In contrast, the distribution of equally likely trees inserts a tree shape bias in split-based supertree methods. This bias can be traced back to the unequal split distribution in the null model. Finally, a supertree can also be defined by minimizing the total distance to the trees in the set, i.e. as a median tree. The majority-rule consensus is described as a median tree method for trees on the same taxon set. For trees on overlapping taxon sets, however, different specifications can be used, namely MR(-)supertrees and MR(+)supertrees. We present algorithms to compute the respective distances in the matrix representation framework. Applying their implementation to simulated data sets shows a clearly better performance of MR(-) compared to MR(+). This discrepancy is likely to trace back to a tree shape bias in MR(+). To conclude, we see that the two aspect of phylogenetic postprocessing, tree distances and tree combination methods, are not independent. Instead, they are linked by the definition of the median tree. Thus our understanding of tree distances influences data combination methods and vice versa

OTHES