296 research outputs found

    Deciding the Closure of Inconsistent Rooted Triples Is NP-Complete

    Get PDF
    Interpreting three-leaf binary trees or rooted triples as constraints yields an entailment relation, whereby binary trees satisfying some rooted triples must also thus satisfy others, and thence a closure operator, which is known to be polynomial-time computable. This is extended to inconsistent triple sets by defining that a triple is entailed by such a set if it is entailed by any consistent subset of it. Determining whether the closure of an inconsistent rooted triple set can be computed in polynomial time was posed as an open problem in the Isaac Newton Institute\u27s "Phylogenetics" program in 2007. It appears (as NC4) in a collection of such open problems maintained by Mike Steel, and it is the last of that collection\u27s five problems concerning computational complexity to have remained open. We resolve the complexity of computing this closure, proving that its decision version is NP-Complete. In the process, we also prove that detecting the existence of any acyclic B-hyperpath (from specified source to destination) is NP-Complete, in a significantly narrower special case than the version whose minimization problem was recently proven NP-hard by Ritz et al. This implies it is NP-hard to approximate (our special case of) their minimization problem to within any factor

    The Orthology Road: Theory and Methods in Orthology Analysis

    Get PDF
    The evolution of biological species depends on changes in genes. Among these changes are the gradual accumulation of DNA mutations, insertions and deletions, duplication of genes, movements of genes within and between chromosomes, gene losses and gene transfer. As two populations of the same species evolve independently, they will eventually become reproductively isolated and become two distinct species. The evolutionary history of a set of related species through the repeated occurrence of this speciation process can be represented as a tree-like structure, called a phylogenetic tree or a species tree. Since duplicated genes in a single species also independently accumulate point mutations, insertions and deletions, they drift apart in composition in the same way as genes in two related species. The divergence of all the genes descended from a single gene in an ancestral species can also be represented as a tree, a gene tree that takes into account both speciation and duplication events. In order to reconstruct the evolutionary history from the study of extant species, we use sets of similar genes, with relatively high degree of DNA similarity and usually with some functional resemblance, that appear to have been derived from a common ancestor. The degree of similarity among different instances of the “same gene” in different species can be used to explore their evolutionary history via the reconstruction of gene family histories, namely gene trees. Orthology refers specifically to the relationship between two genes that arose by a speciation event, recent or remote, rather than duplication. Comparing orthologous genes is essential to the correct reconstruction of species trees, so that detecting and identifying orthologous genes is an important problem, and a longstanding challenge, in comparative and evolutionary genomics as well as phylogenetics. A variety of orthology detection methods have been devised in recent years. Although many of these methods are dependent on generating gene and/or species trees, it has been shown that orthology can be estimated at acceptable levels of accuracy without having to infer gene trees and/or reconciling gene trees with species trees. Therefore, there is good reason to look at the connection of trees and orthology from a different angle: How much information about the gene tree, the species tree, and their reconciliation is already contained in the orthology relation among genes? Intriguingly, a solution to the first part of this question has already been given by Boecker and Dress [Boecker and Dress, 1998] in a different context. In particular, they completely characterized certain maps which they called symbolic ultrametrics. Semple and Steel [Semple and Steel, 2003] then presented an algorithm that can be used to reconstruct a phylogenetic tree from any given symbolic ultrametric. In this thesis we investigate a new characterization of orthology relations, based on symbolic ultramterics for recovering the gene tree. According to Fitch’s definition [Fitch, 2000], two genes are (co-)orthologous if their last common ancestor in the gene tree represents a speciation event. On the other hand, when their last common ancestor is a duplication event, the genes are paralogs. The orthology relation on a set of genes is therefore determined by the gene tree and an “event labeling” that identifies each interior vertex of that tree as either a duplication or a speciation event. In the context of analyzing orthology data, the problem of reconciling event-labeled gene trees with a species tree appears as a variant of the reconciliation problem where genes trees have no labels in their internal vertices. When reconciling a gene tree with a species tree, it can be assumed that the species tree is correct or, in the case of a unknown species tree, it can be inferred. Therefore it is crucial to know for a given gene tree whether there even exists a species tree. In this thesis we characterize event-labelled gene trees for which a species tree exists and species trees to which event-labelled gene trees can be mapped. Reconciliation methods are not always the best options for detecting orthology. A fundamental problem is that, aside from multicellular eukaryotes, evolution does not seem to have conformed to the descent-with-modification model that gives rise to tree-like phylogenies. Examples include many cases of prokaryotes and viruses whose evolution involved horizontal gene transfer. To treat the problem of distinguishing orthology and paralogy within a more general framework, graph-based methods have been proposed to detect and differentiate among evolutionary relationships of genes in those organisms. In this work we introduce a measure of orthology that can be used to test graph-based methods and reconciliation methods that detect orthology. Using these results a new algorithm BOTTOM-UP to determine whether a map from the set of vertices of a tree to a set of events is a symbolic ultrametric or not is devised. Additioanlly, a simulation environment designed to generate large gene families with complex duplication histories on which reconstruction algorithms can be tested and software tools can be benchmarked is presented

    A Survey of Combinatorial Methods for Phylogenetic Networks

    Get PDF
    The evolutionary history of a set of species is usually described by a rooted phylogenetic tree. Although it is generally undisputed that bifurcating speciation events and descent with modifications are major forces of evolution, there is a growing belief that reticulate events also have a role to play. Phylogenetic networks provide an alternative to phylogenetic trees and may be more suitable for data sets where evolution involves significant amounts of reticulate events, such as hybridization, horizontal gene transfer, or recombination. In this article, we give an introduction to the topic of phylogenetic networks, very briefly describing the fundamental concepts and summarizing some of the most important combinatorial methods that are available for their computation

    Four Lessons in Versatility or How Query Languages Adapt to the Web

    Get PDF
    Exposing not only human-centered information, but machine-processable data on the Web is one of the commonalities of recent Web trends. It has enabled a new kind of applications and businesses where the data is used in ways not foreseen by the data providers. Yet this exposition has fractured the Web into islands of data, each in different Web formats: Some providers choose XML, others RDF, again others JSON or OWL, for their data, even in similar domains. This fracturing stifles innovation as application builders have to cope not only with one Web stack (e.g., XML technology) but with several ones, each of considerable complexity. With Xcerpt we have developed a rule- and pattern based query language that aims to give shield application builders from much of this complexity: In a single query language XML and RDF data can be accessed, processed, combined, and re-published. Though the need for combined access to XML and RDF data has been recognized in previous work (including the W3C’s GRDDL), our approach differs in four main aspects: (1) We provide a single language (rather than two separate or embedded languages), thus minimizing the conceptual overhead of dealing with disparate data formats. (2) Both the declarative (logic-based) and the operational semantics are unified in that they apply for querying XML and RDF in the same way. (3) We show that the resulting query language can be implemented reusing traditional database technology, if desirable. Nevertheless, we also give a unified evaluation approach based on interval labelings of graphs that is at least as fast as existing approaches for tree-shaped XML data, yet provides linear time and space querying also for many RDF graphs. We believe that Web query languages are the right tool for declarative data access in Web applications and that Xcerpt is a significant step towards a more convenient, yet highly efficient data access in a “Web of Data”

    Implementation of Web Query Languages Reconsidered

    Get PDF
    Visions of the next generation Web such as the "Semantic Web" or the "Web 2.0" have triggered the emergence of a multitude of data formats. These formats have different characteristics as far as the shape of data is concerned (for example tree- vs. graph-shaped). They are accompanied by a puzzlingly large number of query languages each limited to one data format. Thus, a key feature of the Web, namely to make it possible to access anything published by anyone, is compromised. This thesis is devoted to versatile query languages capable of accessing data in a variety of Web formats. The issue is addressed from three angles: language design, common, yet uniform semantics, and common, yet uniform evaluation. % Thus it is divided in three parts: First, we consider the query language Xcerpt as an example of the advocated class of versatile Web query languages. Using this concrete exemplar allows us to clarify and discuss the vision of versatility in detail. Second, a number of query languages, XPath, XQuery, SPARQL, and Xcerpt, are translated into a common intermediary language, CIQLog. This language has a purely logical semantics, which makes it easily amenable to optimizations. As a side effect, this provides the, to the best of our knowledge, first logical semantics for XQuery and SPARQL. It is a very useful tool for understanding the commonalities and differences of the considered languages. Third, the intermediate logical language is translated into a query algebra, CIQCAG. The core feature of CIQCAG is that it scales from tree- to graph-shaped data and queries without efficiency losses when tree-data and -queries are considered: it is shown that, in these cases, optimal complexities are achieved. CIQCAG is also shown to evaluate each of the aforementioned query languages with a complexity at least as good as the best known evaluation methods so far. For example, navigational XPath is evaluated with space complexity O(q d) and time complexity O(q n) where q is the query size, n the data size, and d the depth of the (tree-shaped) data. CIQCAG is further shown to provide linear time and space evaluation of tree-shaped queries for a larger class of graph-shaped data than any method previously proposed. This larger class of graph-shaped data, called continuous-image graphs, short CIGs, is introduced for the first time in this thesis. A (directed) graph is a CIG if its nodes can be totally ordered in such a manner that, for this order, the children of any node form a continuous interval. CIQCAG achieves these properties by employing a novel data structure, called sequence map, that allows an efficient evaluation of tree-shaped queries, or of tree-shaped cores of graph-shaped queries on any graph-shaped data. While being ideally suited to trees and CIGs, the data structure gracefully degrades to unrestricted graphs. It yields a remarkably efficient evaluation on graph-shaped data that only a few edges prevent from being trees or CIGs

    Sur l'analyse statique des requêtes SPARQL avec la logique modale

    Get PDF
    Static analysis is a core task in query optimization and knowledge base verification. We study static analysis techniques for SPARQL, the standard language for querying Semantic Web data. Specifically, we investigate the query containment problem and the query-update independence analysis. We are interested in developing techniques through reductions to the validity problem in logic.We address SPARQL query containment with optional matching. We focus on the class of well-designed SPARQL queries, proposed in the literature as a fragment of the language with good properties regarding query evaluation. SPARQL is interpreted over graphs, hence we encode it in a graph logic, specifically the modal logic K interpreted over label transition systems. We show that this logic is powerful enough to deal with query containment for the well-designed fragment of SPARQL. We show how to translate RDF graphs into transition systems and SPARQL queries into K-formulae. Therefore, query containment in SPARQL can be reduced to unsatisfiability in K.We also report on a preliminary overview of the SPARQL query-update problem. A query is independent of an update when the execution of the update does not affect the result of the query. Determining independence is especially useful in the contest of huge RDF repositories, where it permits to avoid expensive yet useless re-evaluation of queries. While this problem has been intensively studied for fragments of relational calculus, no works exist for the standard query language for the semantic web. We report on our investigations on how a notion of independence can be defined in the SPARQL contextL’analyse statique est une tâche essentielle dans l’optimisation des requêtes et la vérification de la base de graphes RDF. Nous étudions des techniques d’analyse statique pour SPARQL, le langage standard pour l’interrogation des données du Web sémantique. Plus précisément, nous étudions le problème d’inclusion des requêtes et de l’analyse de l’indépendance entre les requêtes et la mise à jour de la base de graphes RDF.Nous sommes intéressés par le développement de techniques grâce à des réductions au problème de la satisfaisabilité de la logique.Nous nous traitons le problème d’inclusion des requêtes SPARQL en présence de l’opérateur OPTIONAL. L’optionalité est l’un des constructeurs les plus compliqués dans SPARQL et aussi celui qui rend ce langage plus expressif que les langages de requêtes classiques, comme SQL.Nous nous concentrons sur la classe de requêtes appelée "well-designed SPARQL", proposées dans la littérature comme un fragment du langage avec de bonnes propriétés en matière d’évaluation des requêtes incluent l’opération OPTIONAL. À ce jour, l’inclusion de requête a été testée à l’aide de différentes techniques: homomorphisme de graphes, bases de données canoniques, techniques de la théorie des automates et réduction au problème de la validité d’une logique. Dans cette thèse, nous utilisons la dernière technique pour tester l’inclusion des requêtes SPARQL avec OPTIONAL utilisant une logique expressive appelée «logique K». En utilisant cette technique, il est possible de régler le problème d’inclusion des requêtes pour plusieurs fragment de SPARQL, même en présence de schémas. Cette extensibilité n’est pas garantie par les autres méthodes.Nous montrons comment traduire a graphe RDF en un système de transitions, ainsi que une requête SPARQL en une formula K. Avec ces traductions, l’inclusion des requêtes dans SPARQL peut être réduite au test de la validité d’une formule logique. Un avantage de cette approche est d’ouvrir la voie pour des implémentations utilisant solveurs de satisfiabilité pour K.Nous présentons un banc d’essais de tests d’inclusion pour les requêtes SPARQL avec OPTIONAL. Nous avons effectué des expériences pour tester et comparer des solveurs d’inclusion de l’état de l’art.Nous présentons également un aperçu préliminaire du problème d’indépendance entre requête et mise à jour. Une requête est indépendante de la mise à jour lorsque l’exécution de la mise à jour ne modifie pas le résultat de la requête. Bien que ce problème ait été intensivement étudié pour des fragments de calcul relationnel, il n’existe pas de travaux pour le langage de requêtes standard pour le web sémantique. Nous proposons une définition de la notion de l’indépendance dans le contexte de SPARQL et nous établissons des premières pistes de analyse statique dans certains situations d’inclusion entre une requête et une mise à jour

    Gene Family Histories: Theory and Algorithms

    Get PDF
    Detailed gene family histories and reconciliations with species trees are a prerequisite for studying associations between genetic and phenotypic innovations. Even though the true evolutionary scenarios are usually unknown, they impose certain constraints on the mathematical structure of data obtained from simple yes/no questions in pairwise comparisons of gene sequences. Recent advances in this field have led to the development of methods for reconstructing (aspects of) the scenarios on the basis of such relation data, which can most naturally be represented by graphs on the set of considered genes. We provide here novel characterizations of best match graphs (BMGs) which capture the notion of (reciprocal) best hits based on sequence similarities. BMGs provide the basis for the detection of orthologous genes (genes that diverged after a speciation event). There are two main sources of error in pipelines for orthology inference based on BMGs. Firstly, measurement errors in the estimation of best matches from sequence similarity in general lead to violations of the characteristic properties of BMGs. The second issue concerns the reconstruction of the orthology relation from a BMG. We show how to correct estimated BMG to mathematically valid ones and how much information about orthologs is contained in BMGs. We then discuss implicit methods for horizontal gene transfer (HGT) inference that focus on pairs of genes that have diverged only after the divergence of the two species in which the genes reside. This situation defines the edge set of an undirected graph, the later-divergence-time (LDT) graph. We explore the mathematical structure of LDT graphs and show how much information about all HGT events is contained in such LDT graphs

    Automatic & Semi-Automatic Methods for Supporting Ontology Change

    Get PDF
    corecore