437 research outputs found

    Efficient Discovery of Ontology Functional Dependencies

    Full text link
    Poor data quality has become a pervasive issue due to the increasing complexity and size of modern datasets. Constraint based data cleaning techniques rely on integrity constraints as a benchmark to identify and correct errors. Data values that do not satisfy the given set of constraints are flagged as dirty, and data updates are made to re-align the data and the constraints. However, many errors often require user input to resolve due to domain expertise defining specific terminology and relationships. For example, in pharmaceuticals, 'Advil' \emph{is-a} brand name for 'ibuprofen' that can be captured in a pharmaceutical ontology. While functional dependencies (FDs) have traditionally been used in existing data cleaning solutions to model syntactic equivalence, they are not able to model broader relationships (e.g., is-a) defined by an ontology. In this paper, we take a first step towards extending the set of data quality constraints used in data cleaning by defining and discovering \emph{Ontology Functional Dependencies} (OFDs). We lay out theoretical and practical foundations for OFDs, including a set of sound and complete axioms, and a linear inference procedure. We then develop effective algorithms for discovering OFDs, and a set of optimizations that efficiently prune the search space. Our experimental evaluation using real data show the scalability and accuracy of our algorithms.Comment: 12 page

    Integrity Constraints Revisited: From Exact to Approximate Implication

    Get PDF
    Integrity constraints such as functional dependencies (FD), and multi-valued dependencies (MVD) are fundamental in database schema design. Likewise, probabilistic conditional independences (CI) are crucial for reasoning about multivariate probability distributions. The implication problem studies whether a set of constraints (antecedents) implies another constraint (consequent), and has been investigated in both the database and the AI literature, under the assumption that all constraints hold exactly. However, many applications today consider constraints that hold only approximately. In this paper we define an approximate implication as a linear inequality between the degree of satisfaction of the antecedents and consequent, and we study the relaxation problem: when does an exact implication relax to an approximate implication? We use information theory to define the degree of satisfaction, and prove several results. First, we show that any implication from a set of data dependencies (MVDs+FDs) can be relaxed to a simple linear inequality with a factor at most quadratic in the number of variables; when the consequent is an FD, the factor can be reduced to 1. Second, we prove that there exists an implication between CIs that does not admit any relaxation; however, we prove that every implication between CIs relaxes "in the limit". Finally, we show that the implication problem for differential constraints in market basket analysis also admits a relaxation with a factor equal to 1. Our results recover, and sometimes extend, several previously known results about the implication problem: implication of MVDs can be checked by considering only 2-tuple relations, and the implication of differential constraints for frequent item sets can be checked by considering only databases containing a single transaction

    Using concept lattices to mine functional dependencies

    Get PDF
    Concept Lattices have been proved to be a valuable tool to represent the knowlegde in a database. In this paper we show how functional dependencies in databases can be extracted using Concept Lattices, not preprocessing the original database, but providing a new closure operator. We also prove that this method generalizes the previous methods and closure operators that are being used to find association rules in binary databases.Postprint (published version

    Integrity Constraints Revisited: From Exact to Approximate Implication

    Full text link
    Integrity constraints such as functional dependencies (FD), and multi-valued dependencies (MVD) are fundamental in database schema design. Likewise, probabilistic conditional independences (CI) are crucial for reasoning about multivariate probability distributions. The implication problem studies whether a set of constraints (antecedents) implies another constraint (consequent), and has been investigated in both the database and the AI literature, under the assumption that all constraints hold exactly. However, many applications today consider constraints that hold only approximately. In this paper we define an approximate implication as a linear inequality between the degree of satisfaction of the antecedents and consequent, and we study the relaxation problem: when does an exact implication relax to an approximate implication? We use information theory to define the degree of satisfaction, and prove several results. First, we show that any implication from a set of data dependencies (MVDs+FDs) can be relaxed to a simple linear inequality with a factor at most quadratic in the number of variables; when the consequent is an FD, the factor can be reduced to 1. Second, we prove that there exists an implication between CIs that does not admit any relaxation; however, we prove that every implication between CIs relaxes "in the limit". Finally, we show that the implication problem for differential constraints in market basket analysis also admits a relaxation with a factor equal to 1. Our results recover, and sometimes extend, several previously known results about the implication problem: implication of MVDs can be checked by considering only 2-tuple relations, and the implication of differential constraints for frequent item sets can be checked by considering only databases containing a single transaction

    Fundamentals and applications of order dependencies

    Get PDF
    Business-intelligence queries often involve SQL functions and algebraic expressions. There can be clear semantic relationships between a column's values and the values of a function over that column. A common property is monotonicity: as the column's values ascend, so do the function's values (or the other column's values). This we call an order dependency (OD). Queries can be evaluated more efficiently when the query optimizer uses order dependencies. They can be run even faster when the optimizer can also reason over known ODs to infer new ones. Order dependencies can be declared as integrity constraints, and they can be detected automatically for many types of SQL functions and algebraic expressions. We present optimization techniques using ODs for queries that involve join, order by, group by, partition by, and distinct. Essentially, ODs can further exploit interesting orders to eliminate or simplify potentially expensive sorts in the query plan. We evaluate these techniques over our prototype implementation in IBM® DB2® using the TPC-DS® benchmark schema and some customer inspired queries. Our experimental results demonstrate a significant performance gain. Dependencies have played an important role in database theory. We study the theoretical aspects of order dependencies-and unidirectional order dependencies (UODs), a proper sub-class of ODs-which describe the relationships among lexicographical orderings of sets of tuples. We investigate the inference problem for order dependencies. We establish the following: (i) a sound and complete axiomatization for UODs which is sound for ODs; (ii) a hierarchy of order dependency classes; (iii) a proof of co-NP-completeness of the inference problem for ODs and for the subclass of UODs; (iv) a proof of co-NP-completeness of the inference problem of functional dependencies (FDs) from ODs in general, but demonstrate linear time complexity for the inference of FDs from UODs; (v) a sound and complete elimination procedure for testing logical implication over ODs; and (vi) a sound and complete polynomial inference algorithm for sets of UODs over natural domains

    A PC Chase

    Get PDF
    PC stands for path-conjunctive, the name of a class of queries and dependencies that we define over complex values with dictionaries. This class includes the relational conjunctive queries and embedded dependencies, as well as many interesting examples of complex value and oodb queries and integrity constraints. We show that some important classical results on containment, dependency implication, and chasing extend and generalize to this class

    Fuzzy inequational logic

    Full text link
    We present a logic for reasoning about graded inequalities which generalizes the ordinary inequational logic used in universal algebra. The logic deals with atomic predicate formulas of the form of inequalities between terms and formalizes their semantic entailment and provability in graded setting which allows to draw partially true conclusions from partially true assumptions. We follow the Pavelka approach and define general degrees of semantic entailment and provability using complete residuated lattices as structures of truth degrees. We prove the logic is Pavelka-style complete. Furthermore, we present a logic for reasoning about graded if-then rules which is obtained as particular case of the general result
    • …
    corecore