688 research outputs found

    Computational Complexity And Algorithms For Dirty Data Evaluation And Repairing

    Get PDF
    In this dissertation, we study the dirty data evaluation and repairing problem in relational database. Dirty data is usually inconsistent, inaccurate, incomplete and stale. Existing methods and theories of consistency describe using integrity constraints, such as data dependencies. However, integrity constraints are good at detection but not at evaluating the degree of data inconsistency and cannot guide the data repairing. This dissertation first studies the computational complexity of and algorithms for the database inconsistency evaluation. We define and use the minimum tuple deletion to evaluate the database inconsistency. For such minimum tuple deletion problem, we study the relationship between the size of rule set and its computational complexity. We show that the minimum tuple deletion problem is NP-hard to approximate the minimum tuple deletion within 17/16 if given three functional dependencies and four attributes involved. A near optimal approximated algorithm for computing the minimum tuple deletion is proposed with a ratio of 2 − 1/2r , where r is the number of given functional dependencies. To guide the data repairing, this dissertation also investigates the data repairing method by using query feedbacks, formally studies two decision problems, functional dependency restricted deletion and insertion propagation problem, corresponding to the feedbacks of deletion and insertion. A comprehensive analysis on both combined and data complexity of the cases is provided by considering different relational operators and feedback types. We have identified the intractable and tractable cases to picture the complexity hierarchy of these problems, and provided the efficient algorithm on these tractable cases. Two improvements are proposed, one focuses on figuring out the minimum vertex cover in conflict graph to improve the upper bound of tuple deletion problem, and the other one is a better dichotomy for deletion and insertion propagation problems at the absence of functional dependencies from the point of respectively considering data, combined and parameterized complexities

    Enumeration Complexity of Conjunctive Queries with Functional Dependencies

    Get PDF
    We study the complexity of enumerating the answers of Conjunctive Queries (CQs) in the presence of Functional Dependencies (FDs). Our focus is on the ability to list output tuples with a constant delay in between, following a linear-time preprocessing. A known dichotomy classifies the acyclic self-join-free CQs into those that admit such enumeration, and those that do not. However, this classification no longer holds in the common case where the database exhibits dependencies among attributes. That is, some queries that are classified as hard are in fact tractable if dependencies are accounted for. We establish a generalization of the dichotomy to accommodate FDs; hence, our classification determines which combination of a CQ and a set of FDs admits constant-delay enumeration with a linear-time preprocessing. In addition, we generalize a hardness result for cyclic CQs to accommodate a common type of FDs. Further conclusions of our development include a dichotomy for enumeration with linear delay, and a dichotomy for CQs with disequalities. Finally, we show that all our results apply to the known class of "cardinality dependencies" that generalize FDs (e.g., by stating an upper bound on the number of genres per movies, or friends per person)

    Enumeration Complexity of Conjunctive Queries with Functional Dependencies

    Get PDF
    We study the complexity of enumerating the answers of Conjunctive Queries (CQs) in the presence of Functional Dependencies (FDs). Our focus is on the ability to list output tuples with a constant delay in between, following a linear-time preprocessing. A known dichotomy classifies the acyclic self-join-free CQs into those that admit such enumeration, and those that do not. However, this classification no longer holds in the common case where the database exhibits dependencies among attributes. That is, some queries that are classified as hard are in fact tractable if dependencies are accounted for. We establish a generalization of the dichotomy to accommodate FDs; hence, our classification determines which combination of a CQ and a set of FDs admits constant-delay enumeration with a linear-time preprocessing. In addition, we generalize a hardness result for cyclic CQs to accommodate a common type of FDs. Further conclusions of our development include a dichotomy for enumeration with linear delay, and a dichotomy for CQs with disequalities. Finally, we show that all our results apply to the known class of "cardinality dependencies" that generalize FDs (e.g., by stating an upper bound on the number of genres per movies, or friends per person)

    A Unified Approach for Resilience and Causal Responsibility with Integer Linear Programming (ILP) and LP Relaxations

    Full text link
    Resilience is one of the key algorithmic problems underlying various forms of reverse data management (such as view maintenance, deletion propagation, and various interventions for fairness): What is the minimal number of tuples to delete from a database in order to remove all answers from a query? A long-open question is determining those conjunctive queries (CQs) for which this problem can be solved in guaranteed PTIME. We shed new light on this and the related problem of causal responsibility by proposing a unified Integer Linear Programming (ILP) formulation. It is unified in that it can solve both prior studied restrictions (e.g., self-join-free CQs under set semantics that allow a PTIME solution) and new cases (e.g., all CQs under set or bag semantics It is also unified in that all queries and all instances are treated with the same approach, and the algorithm is guaranteed to terminate in PTIME for the easy cases. We prove that, for all easy self-join-free CQs, the Linear Programming (LP) relaxation of our encoding is identical to the ILP solution and thus standard ILP solvers are guaranteed to return the solution in PTIME. Our approach opens up the door to new variants and new fine-grained analysis: 1) It also works under bag semantics and we give the first dichotomy result for bags semantics in the problem space. 2) We give a more fine-grained analysis of the complexity of causal responsibility. 3) We recover easy instances for generally hard queries, such as instances with read-once provenance and instances that become easy because of Functional Dependencies in the data. 4) We solve an open conjecture from PODS 2020. 5) Experiments confirm that our results indeed predict the asymptotic running times, and that our universal ILP encoding is at times even faster to solve for the PTIME cases than a prior proposed dedicated flow algorithm.Comment: 25 pages, 16 figure
    • 

    corecore