34,154 research outputs found

    Efficient Genomic Interval Queries Using Augmented Range Trees

    Full text link
    Efficient large-scale annotation of genomic intervals is essential for personal genome interpretation in the realm of precision medicine. There are 13 possible relations between two intervals according to Allen's interval algebra. Conventional interval trees are routinely used to identify the genomic intervals satisfying a coarse relation with a query interval, but cannot support efficient query for more refined relations such as all Allen's relations. We design and implement a novel approach to address this unmet need. Through rewriting Allen's interval relations, we transform an interval query to a range query, then adapt and utilize the range trees for querying. We implement two types of range trees: a basic 2-dimensional range tree (2D-RT) and an augmented range tree with fractional cascading (RTFC) and compare them with the conventional interval tree (IT). Theoretical analysis shows that RTFC can achieve the best time complexity for interval queries regarding all Allen's relations among the three trees. We also perform comparative experiments on the efficiency of RTFC, 2D-RT and IT in querying noncoding element annotations in a large collection of personal genomes. Our experimental results show that 2D-RT is more efficient than IT for interval queries regarding most of Allen's relations, RTFC is even more efficient than 2D-RT. The results demonstrate that RTFC is an efficient data structure for querying large-scale datasets regarding Allen's relations between genomic intervals, such as those required by interpreting genome-wide variation in large populations.Comment: 4 figures, 4 table

    Pebbling and Branching Programs Solving the Tree Evaluation Problem

    Full text link
    We study restricted computation models related to the Tree Evaluation Problem}. The TEP was introduced in earlier work as a simple candidate for the (*very*) long term goal of separating L and LogDCFL. The input to the problem is a rooted, balanced binary tree of height h, whose internal nodes are labeled with binary functions on [k] = {1,...,k} (each given simply as a list of k^2 elements of [k]), and whose leaves are labeled with elements of [k]. Each node obtains a value in [k] equal to its binary function applied to the values of its children, and the output is the value of the root. The first restricted computation model, called Fractional Pebbling, is a generalization of the black/white pebbling game on graphs, and arises in a natural way from the search for good upper bounds on the size of nondeterministic branching programs (BPs) solving the TEP - for any fixed h, if the binary tree of height h has fractional pebbling cost at most p, then there are nondeterministic BPs of size O(k^p) solving the height h TEP. We prove a lower bound on the fractional pebbling cost of d-ary trees that is tight to within an additive constant for each fixed d. The second restricted computation model we study is a semantic restriction on (non)deterministic BPs solving the TEP - Thrifty BPs. Deterministic (resp. nondeterministic) thrifty BPs suffice to implement the best known algorithms for the TEP, based on black (resp. fractional) pebbling. In earlier work, for each fixed h a lower bound on the size of deterministic thrifty BPs was proved that is tight for sufficiently large k. We give an alternative proof that achieves the same bound for all k. We show the same bound still holds in a less-restricted model, and also that gradually weaker lower bounds can be obtained for gradually weaker restrictions on the model.Comment: Written as one of the requirements for my MSc. 29 pages, 6 figure

    Math Search for the Masses: Multimodal Search Interfaces and Appearance-Based Retrieval

    Full text link
    We summarize math search engines and search interfaces produced by the Document and Pattern Recognition Lab in recent years, and in particular the min math search interface and the Tangent search engine. Source code for both systems are publicly available. "The Masses" refers to our emphasis on creating systems for mathematical non-experts, who may be looking to define unfamiliar notation, or browse documents based on the visual appearance of formulae rather than their mathematical semantics.Comment: Paper for Invited Talk at 2015 Conference on Intelligent Computer Mathematics (July, Washington DC

    Web Queries: From a Web of Data to a Semantic Web?

    Get PDF

    Datalog with Negation and Monotonicity

    Get PDF
    Positive Datalog has several nice properties that are lost when the language is extended with negation. One example is that fixpoints of positive Datalog programs are robust w.r.t. the order in which facts are inserted, which facilitates efficient evaluation of such programs in distributed environments. A natural question to ask, given a (stratified) Datalog program with negation, is whether an equivalent positive Datalog program exists. In this context, it is known that positive Datalog can express only a strict subset of the monotone queries, yet the exact relationship between the positive and monotone fragments of semi-positive and stratified Datalog was previously left open. In this paper, we complete the picture by showing that monotone queries expressible in semi-positive Datalog exist which are not expressible in positive Datalog. To provide additional insight into this gap, we also characterize a large class of semi-positive Datalog programs for which the dichotomy `monotone if and only if rewritable to positive Datalog\u27 holds. Finally, we give best-effort techniques to reduce the amount of negation that is exhibited by a program, even if the program is not monotone

    The parameterized space complexity of model-checking bounded variable first-order logic

    Get PDF
    The parameterized model-checking problem for a class of first-order sentences (queries) asks to decide whether a given sentence from the class holds true in a given relational structure (database); the parameter is the length of the sentence. We study the parameterized space complexity of the model-checking problem for queries with a bounded number of variables. For each bound on the quantifier alternation rank the problem becomes complete for the corresponding level of what we call the tree hierarchy, a hierarchy of parameterized complexity classes defined via space bounded alternating machines between parameterized logarithmic space and fixed-parameter tractable time. We observe that a parameterized logarithmic space model-checker for existential bounded variable queries would allow to improve Savitch's classical simulation of nondeterministic logarithmic space in deterministic space O(log2n)O(\log^2n). Further, we define a highly space efficient model-checker for queries with a bounded number of variables and bounded quantifier alternation rank. We study its optimality under the assumption that Savitch's Theorem is optimal

    Query Containment for Highly Expressive Datalog Fragments

    Get PDF
    The containment problem of Datalog queries is well known to be undecidable. There are, however, several Datalog fragments for which containment is known to be decidable, most notably monadic Datalog and several "regular" query languages on graphs. Monadically Defined Queries (MQs) have been introduced recently as a joint generalization of these query languages. In this paper, we study a wide range of Datalog fragments with decidable query containment and determine exact complexity results for this problem. We generalize MQs to (Frontier-)Guarded Queries (GQs), and show that the containment problem is 3ExpTime-complete in either case, even if we allow arbitrary Datalog in the sub-query. If we focus on graph query languages, i.e., fragments of linear Datalog, then this complexity is reduced to 2ExpSpace. We also consider nested queries, which gain further expressivity by using predicates that are defined by inner queries. We show that nesting leads to an exponentially increasing hierarchy for the complexity of query containment, both in the linear and in the general case. Our results settle open problems for (nested) MQs, and they paint a comprehensive picture of the state of the art in Datalog query containment.Comment: 20 page

    Reify Your Collection Queries for Modularity and Speed!

    Full text link
    Modularity and efficiency are often contradicting requirements, such that programers have to trade one for the other. We analyze this dilemma in the context of programs operating on collections. Performance-critical code using collections need often to be hand-optimized, leading to non-modular, brittle, and redundant code. In principle, this dilemma could be avoided by automatic collection-specific optimizations, such as fusion of collection traversals, usage of indexing, or reordering of filters. Unfortunately, it is not obvious how to encode such optimizations in terms of ordinary collection APIs, because the program operating on the collections is not reified and hence cannot be analyzed. We propose SQuOpt, the Scala Query Optimizer--a deep embedding of the Scala collections API that allows such analyses and optimizations to be defined and executed within Scala, without relying on external tools or compiler extensions. SQuOpt provides the same "look and feel" (syntax and static typing guarantees) as the standard collections API. We evaluate SQuOpt by re-implementing several code analyses of the Findbugs tool using SQuOpt, show average speedups of 12x with a maximum of 12800x and hence demonstrate that SQuOpt can reconcile modularity and efficiency in real-world applications.Comment: 20 page
    corecore