568 research outputs found
Making Queries Tractable on Big Data with Preprocessing
A query class is traditionally considered tractable if there exists a polynomial-time (PTIME) algorithm to answer its queries. When it comes to big data, however, PTIME al-gorithms often become infeasible in practice. A traditional and effective approach to coping with this is to preprocess data off-line, so that queries in the class can be subsequently evaluated on the data efficiently. This paper aims to pro-vide a formal foundation for this approach in terms of com-putational complexity. (1) We propose a set of Î -tractable queries, denoted by Î T0Q, to characterize classes of queries that can be answered in parallel poly-logarithmic time (NC) after PTIME preprocessing. (2) We show that several natu-ral query classes are Î -tractable and are feasible on big data. (3) We also study a set Î TQ of query classes that can be ef-fectively converted to Î -tractable queries by re-factorizing its data and queries for preprocessing. We introduce a form of NC reductions to characterize such conversions. (4) We show that a natural query class is complete for Î TQ. (5) We also show that Î T0Q â P unless P = NC, i.e., the set Î T0Q of all Î -tractable queries is properly contained in the set P of all PTIME queries. Nonetheless, Î TQ = P, i.e., all PTIME query classes can be made Î -tractable via proper re-factorizations. This work is a step towards understanding the tractability of queries in the context of big data. 1
Joining Extractions of Regular Expressions
Regular expressions with capture variables, also known as "regex formulas,"
extract relations of spans (interval positions) from text. These relations can
be further manipulated via Relational Algebra as studied in the context of
document spanners, Fagin et al.'s formal framework for information extraction.
We investigate the complexity of querying text by Conjunctive Queries (CQs) and
Unions of CQs (UCQs) on top of regex formulas. We show that the lower bounds
(NP-completeness and W[1]-hardness) from the relational world also hold in our
setting; in particular, hardness hits already single-character text! Yet, the
upper bounds from the relational world do not carry over. Unlike the relational
world, acyclic CQs, and even gamma-acyclic CQs, are hard to compute. The source
of hardness is that it may be intractable to instantiate the relation defined
by a regex formula, simply because it has an exponential number of tuples. Yet,
we are able to establish general upper bounds. In particular, UCQs can be
evaluated with polynomial delay, provided that every CQ has a bounded number of
atoms (while unions and projection can be arbitrary). Furthermore, UCQ
evaluation is solvable with FPT (Fixed-Parameter Tractable) delay when the
parameter is the size of the UCQ
Joining extractions of regular expressions
Regular expressions with capture variables, also known as âregex formulas,â extract relations of spans (interval positions) from text. These relations can be further manipulated via the relational
Algebra as studied in the context of âdocument spanners,â Fagin et al.âs formal framework for information extraction. We investigate the complexity of querying text by Conjunctive Queries (CQs) and Unions of CQs (UCQs) on top of regex formulas. Such queries have been investigated in prior work on document spanners, but little is known about the (combined) complexity of their evaluation. We show that the lower bounds (NP-completeness and W[1]-hardness) from the relational world also hold in our setting; in particular, hardness hits already single-character text. Yet, the upper bounds from the relational world do not carry over. Unlike the relational world, acyclic CQs, and even gamma-acyclic CQs, are hard to compute. The source of hardness is that it may be intractable to instantiate the relation defined by a regex formula, simply because it has an exponential number of tuples. Yet, we are able to establish general upper bounds. In particular, UCQs can be evaluated with polynomial delay, provided that every CQ has a bounded number of atoms (while unions and projection can be arbitrary). Furthermore, UCQ evaluation is solvable with FPT (Fixed-Parameter Tractable) delay when the parameter is the size of the UCQ
Parallelism with limited nondeterminism
Computational complexity theory studies which computational problems can be solved with limited access to resources. The past fifty years have seen a focus on the relationship between intractable problems and efficient algorithms. However, the relationship between inherently sequential problems and highly parallel algorithms has not been as well studied. Are there efficient but inherently sequential problems that admit some relaxed form of highly parallel algorithm? In this dissertation, we develop the theory of structural complexity around this relationship for three common types of computational problems.
Specifically, we show tradeoffs between time, nondeterminism, and parallelizability. By clearly defining the notions and complexity classes that capture our intuition for parallelizable and sequential problems, we create a comprehensive framework for rigorously proving parallelizability and non-parallelizability of computational problems. This framework provides the means to prove whether otherwise tractable problems can be effectively parallelized, a need highlighted by the current growth of multiprocessor systems. The views adopted by this dissertationâalternate approaches to solving sequential problems using approximation, limited nondeterminism, and parameterizationâcan be applied practically throughout computer science
- âŠ