11 research outputs found

    Phylogenetic inference using Hamiltonian Monte Carlo

    Get PDF
    Ph. D. Thesis.Phylogenetics is the study of evolutionary structure, aiming to reconstruct the branching structure of speciation from a common ancestor. There are many methods of infering the tree-like structure from the most basic, physical traits (morphology) to analysing the distances between genetic code based on a prede ned metric. For viruses such a method is the best way to access their hereditity. Bayesian inference enables us to learn a region of possible trees and alter the distribution of trees according to prior beliefs. The most common method of conducting Bayesian inference over evolutionary trees, called Tree space (Billera et al., 2001), is by Markov Chain Monte Carlo (MCMC). Tree space is big and exploration is slow; a modern technique for speeding up MCMC is Hamiltonian Monte Carlo (HMC), developed by Duane et al. (1987). We incorporate HMC into Tree space by creating our own algorithm: Cross-Orthant HMC (COrtHMC). Many methods of increasing HMC convergence speed have been developed, such as Riemannian Manifold HMC (RM-HMC) (Girolami et al., 2011). Where applicable, we adapted such methods to COrtHMC and then compared COrtHMC to pre-existing methods of phylogenetic inference and probabilistic path HMC (Dinh et al., 2017). We found that all forms of COrtHMC perform similarly, including ppHMC, but that the increased computational cost in using such HMC methods outweighs any bene t.EPSR

    Statistical estimation problems in phylogenomics and applications in microbial ecology

    Get PDF
    With the growing awareness of the potential for microbial communities to play a role in human health, environmental remediation and other important processes, the challenge of understanding such a complex population through the lens of high-throughput sequencing output has risen to the fore. For a de novo sequenced community, the first step to understanding the population involves comparing the sequences to a reference database in some form. In this dissertation, we consider some challenges and benefits of organizing the reference data according to evolution, with orthologous genes grouped together and stored as a multiple sequence alignment and phylogenetic tree. First we consider the related problem of estimating the population-level phylogeny of a group of species based on the alignments and phylogenies of several individual genes. Under one common model, species tree estimation is provably statistically consistent by several different methods, but those proofs rely on two separate and potentially shaky assumptions: that every species appears in the data for every gene (i.e., there is no missing data), and that since gene tree estimation is itself consistent, the gene trees used to compute the population-level tree are correct. Second, we explore some novel ways to use a Bayesian MCMC algorithm for jointly estimating alignment and phylogeny. The result is increased accuracy for large alignments, where the MCMC method alone would not be tractable. In the process, we identify a peculiar property of this Bayesian algorithm: it performs much differently on simulated sequences than on sequences from biological alignment benchmarks. No other alignment method tested showed the same divergence. Finally, we present two different practical applications a reference database containing an alignment and tree for a group of gene families in the context of microbial ecology. The first is an algorithm that uses the tree and alignment to construct an ensemble of profile hidden Markov models that improves remote homology detection. The second is a data visualization technique that generates an image of the community with a high density of data, but one that makes it naturally easy to compare many different samples at a time, potentially uncovering otherwise elusive patterns in the data

    The Orthology Road: Theory and Methods in Orthology Analysis

    Get PDF
    The evolution of biological species depends on changes in genes. Among these changes are the gradual accumulation of DNA mutations, insertions and deletions, duplication of genes, movements of genes within and between chromosomes, gene losses and gene transfer. As two populations of the same species evolve independently, they will eventually become reproductively isolated and become two distinct species. The evolutionary history of a set of related species through the repeated occurrence of this speciation process can be represented as a tree-like structure, called a phylogenetic tree or a species tree. Since duplicated genes in a single species also independently accumulate point mutations, insertions and deletions, they drift apart in composition in the same way as genes in two related species. The divergence of all the genes descended from a single gene in an ancestral species can also be represented as a tree, a gene tree that takes into account both speciation and duplication events. In order to reconstruct the evolutionary history from the study of extant species, we use sets of similar genes, with relatively high degree of DNA similarity and usually with some functional resemblance, that appear to have been derived from a common ancestor. The degree of similarity among different instances of the “same gene” in different species can be used to explore their evolutionary history via the reconstruction of gene family histories, namely gene trees. Orthology refers specifically to the relationship between two genes that arose by a speciation event, recent or remote, rather than duplication. Comparing orthologous genes is essential to the correct reconstruction of species trees, so that detecting and identifying orthologous genes is an important problem, and a longstanding challenge, in comparative and evolutionary genomics as well as phylogenetics. A variety of orthology detection methods have been devised in recent years. Although many of these methods are dependent on generating gene and/or species trees, it has been shown that orthology can be estimated at acceptable levels of accuracy without having to infer gene trees and/or reconciling gene trees with species trees. Therefore, there is good reason to look at the connection of trees and orthology from a different angle: How much information about the gene tree, the species tree, and their reconciliation is already contained in the orthology relation among genes? Intriguingly, a solution to the first part of this question has already been given by Boecker and Dress [Boecker and Dress, 1998] in a different context. In particular, they completely characterized certain maps which they called symbolic ultrametrics. Semple and Steel [Semple and Steel, 2003] then presented an algorithm that can be used to reconstruct a phylogenetic tree from any given symbolic ultrametric. In this thesis we investigate a new characterization of orthology relations, based on symbolic ultramterics for recovering the gene tree. According to Fitch’s definition [Fitch, 2000], two genes are (co-)orthologous if their last common ancestor in the gene tree represents a speciation event. On the other hand, when their last common ancestor is a duplication event, the genes are paralogs. The orthology relation on a set of genes is therefore determined by the gene tree and an “event labeling” that identifies each interior vertex of that tree as either a duplication or a speciation event. In the context of analyzing orthology data, the problem of reconciling event-labeled gene trees with a species tree appears as a variant of the reconciliation problem where genes trees have no labels in their internal vertices. When reconciling a gene tree with a species tree, it can be assumed that the species tree is correct or, in the case of a unknown species tree, it can be inferred. Therefore it is crucial to know for a given gene tree whether there even exists a species tree. In this thesis we characterize event-labelled gene trees for which a species tree exists and species trees to which event-labelled gene trees can be mapped. Reconciliation methods are not always the best options for detecting orthology. A fundamental problem is that, aside from multicellular eukaryotes, evolution does not seem to have conformed to the descent-with-modification model that gives rise to tree-like phylogenies. Examples include many cases of prokaryotes and viruses whose evolution involved horizontal gene transfer. To treat the problem of distinguishing orthology and paralogy within a more general framework, graph-based methods have been proposed to detect and differentiate among evolutionary relationships of genes in those organisms. In this work we introduce a measure of orthology that can be used to test graph-based methods and reconciliation methods that detect orthology. Using these results a new algorithm BOTTOM-UP to determine whether a map from the set of vertices of a tree to a set of events is a symbolic ultrametric or not is devised. Additioanlly, a simulation environment designed to generate large gene families with complex duplication histories on which reconstruction algorithms can be tested and software tools can be benchmarked is presented

    Gene Family Histories: Theory and Algorithms

    Get PDF
    Detailed gene family histories and reconciliations with species trees are a prerequisite for studying associations between genetic and phenotypic innovations. Even though the true evolutionary scenarios are usually unknown, they impose certain constraints on the mathematical structure of data obtained from simple yes/no questions in pairwise comparisons of gene sequences. Recent advances in this field have led to the development of methods for reconstructing (aspects of) the scenarios on the basis of such relation data, which can most naturally be represented by graphs on the set of considered genes. We provide here novel characterizations of best match graphs (BMGs) which capture the notion of (reciprocal) best hits based on sequence similarities. BMGs provide the basis for the detection of orthologous genes (genes that diverged after a speciation event). There are two main sources of error in pipelines for orthology inference based on BMGs. Firstly, measurement errors in the estimation of best matches from sequence similarity in general lead to violations of the characteristic properties of BMGs. The second issue concerns the reconstruction of the orthology relation from a BMG. We show how to correct estimated BMG to mathematically valid ones and how much information about orthologs is contained in BMGs. We then discuss implicit methods for horizontal gene transfer (HGT) inference that focus on pairs of genes that have diverged only after the divergence of the two species in which the genes reside. This situation defines the edge set of an undirected graph, the later-divergence-time (LDT) graph. We explore the mathematical structure of LDT graphs and show how much information about all HGT events is contained in such LDT graphs

    Unrooted unordered homeomorphic subtree alignment of RNA trees

    Get PDF
    Abstract We generalize some current approaches for RNA tree alignment, which are traditionally confined to ordered rooted mappings, to also consider unordered unrooted mappings. We define the Homeomorphic Subtree Alignment problem (HSA), and present a new algorithm which applies to several modes, combining global or local, ordered or unordered, and rooted or unrooted tree alignments. Our algorithm generalizes previous algorithms that either solved the problem in an asymmetric manner, or were restricted to the rooted and/or ordered cases. Focusing here on the most general unrooted unordered case, we show that for input trees T and S, our algorithm has an O(n T n S  + min(d T ,d S )L T L S ) time complexity, where n T ,L T  and d T are the number of nodes, the number of leaves, and the maximum node degree in T, respectively (satisfying d T  ≤ L T  ≤ n T ), and similarly for n S ,L S  and d S  with respect to the tree S. This improves the time complexity of previous algorithms for less general variants of the problem.In order to obtain this time bound for HSA, we developed new algorithms for a generalized variant of the Min-Cost Bipartite Matching problem (MCM), as well as to two derivatives of this problem, entitled All-Cavity-MCM and All-Pairs-Cavity-MCM. For two input sets of size n and m, where n ≤ m, MCM and both its cavity derivatives are solved in O(n 3 + n m) time, without the usage of priority queues (e.g. Fibonacci heaps) or other complex data structures. This gives the first cubic time algorithm for All-Pairs-Cavity-MCM, and improves the running times of MCM and All-Cavity-MCM problems in the unbalanced case where n ≪ m.We implemented the algorithm (in all modes mentioned above) as a graphical software tool which computes and displays similarities between secondary structures of RNA given as input, and employed it to a preliminary experiment in which we ran all-against-all inter-family pairwise alignments of RNAse P and Hammerhead RNA family members, exposing new similarities which could not be detected by the traditional rooted ordered alignment approaches. The results demonstrate that our approach can be used to expose structural similarity between some RNAs with higher sensitivity than the traditional rooted ordered alignment approaches. Source code and web-interface for our tool can be found in http://www.cs.bgu.ac.il/\~negevcb/FRUUT

    LIPIcs, Volume 248, ISAAC 2022, Complete Volume

    Get PDF
    LIPIcs, Volume 248, ISAAC 2022, Complete Volum

    Planare Graphen und ihre Dualgraphen auf Zylinderoberflächen

    Get PDF
    In this thesis, we investigates plane drawings of undirected and directed graphs on cylinder surfaces. In the case of undirected graphs, the vertices are positioned on a line that is parallel to the cylinder’s axis and the edge curves must not intersect this line. We show that a plane drawing is possible if and only if the graph is a double-ended queue (deque) graph, i. e., the vertices of the graph can be processed according to a linear order and the edges correspond to items in the deque inserted and removed at their end vertices. A surprising consequence resulting from these observations is that the deque characterizes planar graphs with a Hamiltonian path. This result extends the known characterization of planar graphs with a Hamiltonian cycle by two stacks. By these insights, we also obtain a new characterization of queue graphs and their duals. We also consider the complexity of deciding whether a graph is a deque graph and prove that it is NP-complete. By introducing a split operation, we obtain the splittable deque and show that it characterizes planarity. For the proof, we devise an algorithm that uses the splittable deque to test whether a rotation system is planar. In the case of directed graphs, we study upward plane drawings where the edge curves follow the direction of the cylinder’s axis (standing upward planarity; SUP) or they wind around the axis (rolling upward planarity; RUP). We characterize RUP graphs by means of their duals and show that RUP and SUP swap their roles when considering a graph and its dual. There is a physical interpretation underlying this characterization: A SUP graph is to its RUP dual graph as electric current passing through a conductor to the magnetic field surrounding the conductor. Whereas testing whether a graph is RUP is NP-hard in general [Bra14], for directed graphs without sources and sink, we develop a linear-time recognition algorithm that is based on our dual graph characterization of RUP graphs.Die Arbeit beschäftigt sich mit planaren Zeichnungen ungerichteter und gerichteter Graphen auf Zylinderoberflächen. Im ungerichteten Fall werden Zeichnungen betrachtet, bei denen die Knoten auf einer Linie parallel zur Zylinderachse positioniert werden und die Kanten diese Linie nicht schneiden dürfen. Es kann gezeigt werden, dass eine planare Zeichnung genau dann möglich ist, wenn die Kanten des Graphen in einer double-ended queue (Deque) verarbeitet werden können. Ebenso lassen sich dadurch Queue, Stack und Doppelstack charakterisieren. Eine überraschende Konsequenz aus diesen Erkenntnissen ist, dass die Deque genau die planaren Graphen mit Hamiltonpfad charakterisiert. Dies erweitert die bereits bekannte Charakterisierung planarer Graphen mit Hamiltonkreis durch den Doppelstack. Im gerichteten Fall müssen die Kantenkurven entweder in Richtung der Zylinderachse verlaufen (SUP-Graphen) oder sich um die Achse herumbewegen (RUP-Graphen). Die Arbeit charakterisiert RUP-Graphen und zeigt, dass RUP und SUP ihre Rollen tauschen, wenn man Graph und Dualgraph betrachtet. Der SUP-Graph verhält sich dabei zum RUP-Graphen wie elektrischer Strom durch einen Leiter zum induzierten Magnetfeld. Ausgehend von dieser Charakterisierung ist es möglich einen Linearzeit-Algorithmus zu entwickeln, der entscheidet ob ein gerichteter Graph ohne Quellen und Senken ein RUP-Graph ist, während der allgemeine Fall NP-hart ist [Bra14]

    29th International Symposium on Algorithms and Computation: ISAAC 2018, December 16-19, 2018, Jiaoxi, Yilan, Taiwan

    Get PDF

    LIPIcs, Volume 274, ESA 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 274, ESA 2023, Complete Volum

    Analysis of the Chickpea genome using next generation sequencing data

    Get PDF
    corecore