26 research outputs found
Fast Algorithms for Large-Scale Phylogenetic Reconstruction
One of the most fundamental computational problems in biology is that of inferring evolutionary histories of groups of species from sequence data. Such evolutionary histories, known as phylogenies are usually represented as binary trees where leaves represent extant species, whereas internal nodes represent their shared ancestors. As the amount of sequence data available to biologists increases, very fast phylogenetic reconstruction algorithms are becoming necessary. Currently, large sequence alignments can contain up to hundreds of thousands of sequences, making traditional methods, such as Neighbor Joining, computationally prohibitive. To address this problem, we have developed three novel fast phylogenetic algorithms.
The first algorithm, QTree, is a quartet-based heuristic that runs in O(n log n) time. It is based on a theoretical algorithm that reconstructs the correct tree, with high probability, assuming every quartet is inferred correctly with constant probability. The core of our algorithm is a balanced search tree structure that enables us to locate an edge in the tree in O(log n) time. Our algorithm is several times faster than all the current methods, while its accuracy approaches that of Neighbour Joining.
The second algorithm, LSHTree, is the first sub-quadratic time algorithm with theoretical performance guarantees under a Markov model of sequence evolution. Our new algorithm runs in O(n^{1+γ(g)} log^2 n) time, where γ is an increasing function of an upper bound on the mutation rate along any branch in the phylogeny, and γ(g) < 1 for all g. For phylogenies with very short branches, the running time of our algorithm is close to linear. In experiments, our prototype implementation was more accurate than the current fast algorithms, while being comparably fast.
In the final part of this thesis, we apply the algorithmic framework behind LSHTree to the problem of placing large numbers of short sequence reads onto a fixed phylogenetic tree. Our initial results in this area are promising, but there are still many challenges to be resolved
Rapidly Computing the Phylogenetic Transfer Index
Given trees T and T_o on the same taxon set, the transfer index phi(b,T_o) is the number of taxa that need to be ignored so that the bipartition induced by branch b in T is equal to some bipartition in T_o. Recently, Lemoine et al. [Lemoine et al., 2018] used the transfer index to design a novel bootstrap analysis technique that improves on Felsenstein\u27s bootstrap on large, noisy data sets. In this work, we propose an algorithm that computes the transfer index for all branches b in T in O(n log^3 n) time, which improves upon the current O(n^2)-time algorithm by Lin, Rajan and Moret [Lin et al., 2012]. Our implementation is able to process pairs of trees with hundreds of thousands of taxa in minutes and considerably speeds up the method of Lemoine et al. on large data sets. We believe our algorithm can be useful for comparing large phylogenies, especially when some taxa are misplaced (e.g. due to horizontal gene transfer, recombination, or reconstruction errors)
New decoding algorithms for Hidden Markov Models using distance measures on labellings
<p>Abstract</p> <p>Background</p> <p>Existing hidden Markov model decoding algorithms do not focus on approximately identifying the sequence feature boundaries.</p> <p>Results</p> <p>We give a set of algorithms to compute the conditional probability of all labellings "near" a reference labelling <it>λ </it>for a sequence <it>y </it>for a variety of definitions of "near". In addition, we give optimization algorithms to find the best labelling for a sequence in the robust sense of having all of its feature boundaries nearly correct. Natural problems in this domain are <it>NP</it>-hard to optimize. For membrane proteins, our algorithms find the approximate topology of such proteins with comparable success to existing programs, while being substantially more accurate in estimating the positions of transmembrane helix boundaries.</p> <p>Conclusion</p> <p>More robust HMM decoding may allow for better analysis of sequence features, in reasonable runtimes.</p
Studia z Dziejów Państwa i Prawa Polskiego T. XXI Badania nad rozwojem instytucji politycznych i prawnych
Tytuł finansowany przez: Krakowską Akademię im. Andrzeja Frycza Modrzewskiego, Uniwersytet im. Marii Curie-Skłodowskiej w Lublinie, Uniwersytet Gdański, Uniwersytet Warszawsk
Computing the probability of gene trees concordant with the species tree in the multispecies coalescent
International audienceThe multispecies coalescent process models the genealogical relationships of genes sampled from several species, enabling useful predictions about phenomena such as the discordance between a gene tree and the species phylogeny due to incomplete lineage sorting. Conversely, knowledge of large collections of gene trees can inform us about several aspects of the species phylogeny, such as its topology and ancestral population sizes. A fundamental open problem in this context is how to e ciently compute the probability of a gene tree topology, given the species phylogeny. Although a number of algorithms for this task have been proposed, they either produce approximate results, or, when they are exact, they do not scale to large data sets. In this paper, we present some progress towards exact and e cient computation of the probability of a gene tree topology. We provide a new algorithm that, given a species tree and the number of genes sampled for each species, calculates the probability that the gene tree topology will be concordant with the species tree. Moreover, we provide an algorithm that computes the probability of any specific gene tree topology concordant with the species tree. Both algorithms run in polynomial time and have been implemented in Python. Experiments show that they are able to analyse data sets where thousands of genes are sampled in a matter of minutes to hours