357 research outputs found
Recommended from our members
The Complexity of Resilience
One focus area in data management research is to understand how changes in the data can affect the output of a view or standing query. Example applications are explaining query results and propagating updates through views. In this thesis we study the complexity of the Resilience problem, which is the problem of finding the minimum number of tuples that need to be deleted from the database in order to change the result of a query. We will see that resilience is closely related to the well-studied problems of deletion propagation and causal responsibility, and that analyzing its complexity offers important insight for solving those problems as well.
Our contributions include the definition of the concept of triads for conjunctive queries, which is a crucial tool on our analysis, and the characterization of a NP versus P dichotomy for the resilience problem considering the class of conjunctive queries without self-joins. Moreover, this result allowed us to show dichotomies for the same class of queries for both deletion propagation with source side-effects and causal responsibility problems. We also completely characterize how the presence of functional dependencies can change the complexity of such problems.
The class of conjunctive queries with self-joins is far richer and more complicated than the self-join-free ones. Therefore we focus on binary queries without variable repetition, which are queries formed by unary or binary relations only and each atom has only one occurrence of any variable. For this restricted case, we identify three main query structures that help us identify complexity: chains, permutations and confluences. Using those we are able to characterize classes of queries for which resilience is NP-complete and some for which it is P
A Unified Approach for Resilience and Causal Responsibility with Integer Linear Programming (ILP) and LP Relaxations
Resilience is one of the key algorithmic problems underlying various forms of
reverse data management (such as view maintenance, deletion propagation, and
various interventions for fairness): What is the minimal number of tuples to
delete from a database in order to remove all answers from a query? A long-open
question is determining those conjunctive queries (CQs) for which this problem
can be solved in guaranteed PTIME. We shed new light on this and the related
problem of causal responsibility by proposing a unified Integer Linear
Programming (ILP) formulation. It is unified in that it can solve both prior
studied restrictions (e.g., self-join-free CQs under set semantics that allow a
PTIME solution) and new cases (e.g., all CQs under set or bag semantics It is
also unified in that all queries and all instances are treated with the same
approach, and the algorithm is guaranteed to terminate in PTIME for the easy
cases. We prove that, for all easy self-join-free CQs, the Linear Programming
(LP) relaxation of our encoding is identical to the ILP solution and thus
standard ILP solvers are guaranteed to return the solution in PTIME. Our
approach opens up the door to new variants and new fine-grained analysis: 1) It
also works under bag semantics and we give the first dichotomy result for bags
semantics in the problem space. 2) We give a more fine-grained analysis of the
complexity of causal responsibility. 3) We recover easy instances for generally
hard queries, such as instances with read-once provenance and instances that
become easy because of Functional Dependencies in the data. 4) We solve an open
conjecture from PODS 2020. 5) Experiments confirm that our results indeed
predict the asymptotic running times, and that our universal ILP encoding is at
times even faster to solve for the PTIME cases than a prior proposed dedicated
flow algorithm.Comment: 25 pages, 16 figure
Computational Complexity And Algorithms For Dirty Data Evaluation And Repairing
In this dissertation, we study the dirty data evaluation and repairing problem in relational database. Dirty data is usually inconsistent, inaccurate, incomplete and stale. Existing methods and theories of consistency describe using integrity constraints, such as data dependencies. However, integrity constraints are good at detection but not at evaluating the degree of data inconsistency and cannot guide the data repairing. This dissertation first studies the computational complexity of and algorithms for the database inconsistency evaluation. We define and use the minimum tuple deletion to evaluate the database inconsistency. For such minimum tuple deletion problem, we study the relationship between the size of rule set and its computational complexity. We show that the minimum tuple deletion problem is NP-hard to approximate the minimum tuple deletion within 17/16 if given three functional dependencies and four attributes involved. A near optimal approximated algorithm for computing the minimum tuple deletion is proposed with a ratio of 2 − 1/2r , where r is the number of given functional dependencies. To guide the data repairing, this dissertation also investigates the data repairing method by using query feedbacks, formally studies two decision problems, functional dependency restricted deletion and insertion propagation problem, corresponding to the feedbacks of deletion and insertion. A comprehensive analysis on both combined and data complexity of the cases is provided by considering different relational operators and feedback types. We have identified the intractable and tractable cases to picture the complexity hierarchy of these problems, and provided the efficient algorithm on these tractable cases. Two improvements are proposed, one focuses on figuring out the minimum vertex cover in conflict graph to improve the upper bound of tuple deletion problem, and the other one is a better dichotomy for deletion and insertion propagation problems at the absence of functional dependencies from the point of respectively considering data, combined and parameterized complexities
Consistent Query Answering for Primary Keys on Rooted Tree Queries
We study the data complexity of consistent query answering (CQA) on databases
that may violate the primary key constraints. A repair is a maximal subset of
the database satisfying the primary key constraints. For a Boolean query q, the
problem CERTAINTY(q) takes a database as input, and asks whether or not each
repair satisfies q. The computational complexity of CERTAINTY(q) has been
established whenever q is a self-join-free Boolean conjunctive query, or a (not
necessarily self-join-free) Boolean path query. In this paper, we take one more
step towards a general classification for all Boolean conjunctive queries by
considering the class of rooted tree queries. In particular, we show that for
every rooted tree query q, CERTAINTY(q) is in FO, NL-hard LFP, or
coNP-complete, and it is decidable (in polynomial time), given q, which of the
three cases applies. We also extend our classification to larger classes of
queries with simple primary keys. Our classification criteria rely on query
homomorphisms and our polynomial-time fixpoint algorithm is based on a novel
use of context-free grammar (CFG).Comment: To appear in PODS'2
Explain3D: Explaining Disagreements in Disjoint Datasets
Data plays an important role in applications, analytic processes, and many
aspects of human activity. As data grows in size and complexity, we are met
with an imperative need for tools that promote understanding and explanations
over data-related operations. Data management research on explanations has
focused on the assumption that data resides in a single dataset, under one
common schema. But the reality of today's data is that it is frequently
un-integrated, coming from different sources with different schemas. When
different datasets provide different answers to semantically similar questions,
understanding the reasons for the discrepancies is challenging and cannot be
handled by the existing single-dataset solutions.
In this paper, we propose Explain3D, a framework for explaining the
disagreements across disjoint datasets (3D). Explain3D focuses on identifying
the reasons for the differences in the results of two semantically similar
queries operating on two datasets with potentially different schemas. Our
framework leverages the queries to perform a semantic mapping across the
relevant parts of their provenance; discrepancies in this mapping point to
causes of the queries' differences. Exploiting the queries gives Explain3D an
edge over traditional schema matching and record linkage techniques, which are
query-agnostic. Our work makes the following contributions: (1) We formalize
the problem of deriving optimal explanations for the differences of the results
of semantically similar queries over disjoint datasets. (2) We design a 3-stage
framework for solving the optimal explanation problem. (3) We develop a
smart-partitioning optimizer that improves the efficiency of the framework by
orders of magnitude. (4)~We experiment with real-world and synthetic data to
demonstrate that Explain3D can derive precise explanations efficiently
Dichotomies in Ontology-Mediated Querying with the Guarded Fragment
We study the complexity of ontology-mediated querying when ontologies are
formulated in the guarded fragment of first-order logic (GF). Our general aim
is to classify the data complexity on the level of ontologies where query
evaluation w.r.t. an ontology O is considered to be in PTime if all (unions of
conjunctive) queries can be evaluated in PTime w.r.t. O and coNP-hard if at
least one query is coNP-hard w.r.t. O. We identify several large and relevant
fragments of GF that enjoy a dichotomy between PTime and coNP, some of them
additionally admitting a form of counting. In fact, almost all ontologies in
the BioPortal repository fall into these fragments or can easily be rewritten
to do so. We then establish a variation of Ladner's Theorem on the existence of
NP-intermediate problems and use this result to show that for other fragments,
there is provably no such dichotomy. Again for other fragments (such as full
GF), establishing a dichotomy implies the Feder-Vardi conjecture on the
complexity of constraint satisfaction problems. We also link these results to
Datalog-rewritability and study the decidability of whether a given ontology
enjoys PTime query evaluation, presenting both positive and negative results
- …