34,154 research outputs found
Efficient Genomic Interval Queries Using Augmented Range Trees
Efficient large-scale annotation of genomic intervals is essential for
personal genome interpretation in the realm of precision medicine. There are 13
possible relations between two intervals according to Allen's interval algebra.
Conventional interval trees are routinely used to identify the genomic
intervals satisfying a coarse relation with a query interval, but cannot
support efficient query for more refined relations such as all Allen's
relations. We design and implement a novel approach to address this unmet need.
Through rewriting Allen's interval relations, we transform an interval query to
a range query, then adapt and utilize the range trees for querying. We
implement two types of range trees: a basic 2-dimensional range tree (2D-RT)
and an augmented range tree with fractional cascading (RTFC) and compare them
with the conventional interval tree (IT). Theoretical analysis shows that RTFC
can achieve the best time complexity for interval queries regarding all Allen's
relations among the three trees. We also perform comparative experiments on the
efficiency of RTFC, 2D-RT and IT in querying noncoding element annotations in a
large collection of personal genomes. Our experimental results show that 2D-RT
is more efficient than IT for interval queries regarding most of Allen's
relations, RTFC is even more efficient than 2D-RT. The results demonstrate that
RTFC is an efficient data structure for querying large-scale datasets regarding
Allen's relations between genomic intervals, such as those required by
interpreting genome-wide variation in large populations.Comment: 4 figures, 4 table
Pebbling and Branching Programs Solving the Tree Evaluation Problem
We study restricted computation models related to the Tree Evaluation
Problem}. The TEP was introduced in earlier work as a simple candidate for the
(*very*) long term goal of separating L and LogDCFL. The input to the problem
is a rooted, balanced binary tree of height h, whose internal nodes are labeled
with binary functions on [k] = {1,...,k} (each given simply as a list of k^2
elements of [k]), and whose leaves are labeled with elements of [k]. Each node
obtains a value in [k] equal to its binary function applied to the values of
its children, and the output is the value of the root. The first restricted
computation model, called Fractional Pebbling, is a generalization of the
black/white pebbling game on graphs, and arises in a natural way from the
search for good upper bounds on the size of nondeterministic branching programs
(BPs) solving the TEP - for any fixed h, if the binary tree of height h has
fractional pebbling cost at most p, then there are nondeterministic BPs of size
O(k^p) solving the height h TEP. We prove a lower bound on the fractional
pebbling cost of d-ary trees that is tight to within an additive constant for
each fixed d. The second restricted computation model we study is a semantic
restriction on (non)deterministic BPs solving the TEP - Thrifty BPs.
Deterministic (resp. nondeterministic) thrifty BPs suffice to implement the
best known algorithms for the TEP, based on black (resp. fractional) pebbling.
In earlier work, for each fixed h a lower bound on the size of deterministic
thrifty BPs was proved that is tight for sufficiently large k. We give an
alternative proof that achieves the same bound for all k. We show the same
bound still holds in a less-restricted model, and also that gradually weaker
lower bounds can be obtained for gradually weaker restrictions on the model.Comment: Written as one of the requirements for my MSc. 29 pages, 6 figure
Math Search for the Masses: Multimodal Search Interfaces and Appearance-Based Retrieval
We summarize math search engines and search interfaces produced by the
Document and Pattern Recognition Lab in recent years, and in particular the min
math search interface and the Tangent search engine. Source code for both
systems are publicly available. "The Masses" refers to our emphasis on creating
systems for mathematical non-experts, who may be looking to define unfamiliar
notation, or browse documents based on the visual appearance of formulae rather
than their mathematical semantics.Comment: Paper for Invited Talk at 2015 Conference on Intelligent Computer
Mathematics (July, Washington DC
Datalog with Negation and Monotonicity
Positive Datalog has several nice properties that are lost when the language is extended with negation. One example is that fixpoints of positive Datalog programs are robust w.r.t. the order in which facts are inserted, which facilitates efficient evaluation of such programs in distributed environments. A natural question to ask, given a (stratified) Datalog program with negation, is whether an equivalent positive Datalog program exists.
In this context, it is known that positive Datalog can express only a strict subset of the monotone queries, yet the exact relationship between the positive and monotone fragments of semi-positive and stratified Datalog was previously left open. In this paper, we complete the picture by showing that monotone queries expressible in semi-positive Datalog exist which are not expressible in positive Datalog. To provide additional insight into this gap, we also characterize a large class of semi-positive Datalog programs for which the dichotomy `monotone if and only if rewritable to positive Datalog\u27 holds. Finally, we give best-effort techniques to reduce the amount of negation that is exhibited by a program, even if the program is not monotone
The parameterized space complexity of model-checking bounded variable first-order logic
The parameterized model-checking problem for a class of first-order sentences
(queries) asks to decide whether a given sentence from the class holds true in
a given relational structure (database); the parameter is the length of the
sentence. We study the parameterized space complexity of the model-checking
problem for queries with a bounded number of variables. For each bound on the
quantifier alternation rank the problem becomes complete for the corresponding
level of what we call the tree hierarchy, a hierarchy of parameterized
complexity classes defined via space bounded alternating machines between
parameterized logarithmic space and fixed-parameter tractable time. We observe
that a parameterized logarithmic space model-checker for existential bounded
variable queries would allow to improve Savitch's classical simulation of
nondeterministic logarithmic space in deterministic space .
Further, we define a highly space efficient model-checker for queries with a
bounded number of variables and bounded quantifier alternation rank. We study
its optimality under the assumption that Savitch's Theorem is optimal
Query Containment for Highly Expressive Datalog Fragments
The containment problem of Datalog queries is well known to be undecidable.
There are, however, several Datalog fragments for which containment is known to
be decidable, most notably monadic Datalog and several "regular" query
languages on graphs. Monadically Defined Queries (MQs) have been introduced
recently as a joint generalization of these query languages. In this paper, we
study a wide range of Datalog fragments with decidable query containment and
determine exact complexity results for this problem. We generalize MQs to
(Frontier-)Guarded Queries (GQs), and show that the containment problem is
3ExpTime-complete in either case, even if we allow arbitrary Datalog in the
sub-query. If we focus on graph query languages, i.e., fragments of linear
Datalog, then this complexity is reduced to 2ExpSpace. We also consider nested
queries, which gain further expressivity by using predicates that are defined
by inner queries. We show that nesting leads to an exponentially increasing
hierarchy for the complexity of query containment, both in the linear and in
the general case. Our results settle open problems for (nested) MQs, and they
paint a comprehensive picture of the state of the art in Datalog query
containment.Comment: 20 page
Reify Your Collection Queries for Modularity and Speed!
Modularity and efficiency are often contradicting requirements, such that
programers have to trade one for the other. We analyze this dilemma in the
context of programs operating on collections. Performance-critical code using
collections need often to be hand-optimized, leading to non-modular, brittle,
and redundant code. In principle, this dilemma could be avoided by automatic
collection-specific optimizations, such as fusion of collection traversals,
usage of indexing, or reordering of filters. Unfortunately, it is not obvious
how to encode such optimizations in terms of ordinary collection APIs, because
the program operating on the collections is not reified and hence cannot be
analyzed.
We propose SQuOpt, the Scala Query Optimizer--a deep embedding of the Scala
collections API that allows such analyses and optimizations to be defined and
executed within Scala, without relying on external tools or compiler
extensions. SQuOpt provides the same "look and feel" (syntax and static typing
guarantees) as the standard collections API. We evaluate SQuOpt by
re-implementing several code analyses of the Findbugs tool using SQuOpt, show
average speedups of 12x with a maximum of 12800x and hence demonstrate that
SQuOpt can reconcile modularity and efficiency in real-world applications.Comment: 20 page
- …