17,303 research outputs found
Pattern Functional Dependencies for Data Cleaning
Patterns (or regex-based expressions) are widely used to constrain the format of a domain (or a column), e.g., a Year column should contain only four digits, and thus a value like "1980-" might be a typo. Moreover, integrity constraints (ICs) defined over multiple columns, such as (conditional) functional dependencies and denial constraints, e.g., a ZIP code uniquely determines a city in the UK, have been widely used in data cleaning. However, a promising, but not yet explored, direction is to combine regex- and IC-based theories to capture data dependencies involving partial attribute values. For example, in an employee ID such as"F-9-107", "F" is sufficient to determine the finance department. Inspired by the above observation, we propose a novel class of ICs, called pattern functional dependencies (PFDs), to model fine-grained data dependencies gleaned from partial attribute values. These dependencies cannot be modeled using traditional ICs, such as (conditional) functional dependencies, which work on entire attribute values. We also present a set of axioms for the inference of PFDs, analogous to Armstrong's axioms for FDs, and study the complexity of consistency and implication analysis of PFDs. Moreover, we devise an effective algorithm to automatically discover PFDs even in the presence of errors in the data. Our extensive experiments on 15 real-world datasets show that our approach can effectively discover valid and useful PFDs over dirty data, which can then be used to detect data errors that are hard to capture by other types of ICs
Characterizing approximate-matching dependencies in formal concept analysis with pattern structures
Functional dependencies (FDs) provide valuable knowledge on the relations between attributes of a data table. A functional dependency holds when the values of an attribute can be determined by another. It has been shown that FDs can be expressed in terms of partitions of tuples that are in agreement w.r.t. the values taken by some subsets of attributes. To extend the use of FDs, several generalizations have been proposed. In this work, we study approximatematching dependencies that generalize FDs by relaxing the constraints on the attributes, i.e. agreement is based on a similarity relation rather than on equality. Such dependencies are attracting attention in the database field since they allow uncrisping the basic notion of FDs extending its application to many different fields, such as data quality, data mining, behavior analysis, data cleaning or data partition, among others. We show that these dependencies can be formalized in the framework of Formal Concept Analysis (FCA) using a previous formalization introduced for standard FDs. Our new results state that, starting from the conceptual structure of a pattern structure, and generalizing the notion of relation between tuples, approximate-matching dependencies can be characterized as implications in a pattern concept lattice. We finally show how to use basic FCA algorithms to construct a pattern concept lattice that entails these dependencies after a slight and tractable binarization of the original data.Postprint (author's final draft
Efficient Discovery of Ontology Functional Dependencies
Poor data quality has become a pervasive issue due to the increasing
complexity and size of modern datasets. Constraint based data cleaning
techniques rely on integrity constraints as a benchmark to identify and correct
errors. Data values that do not satisfy the given set of constraints are
flagged as dirty, and data updates are made to re-align the data and the
constraints. However, many errors often require user input to resolve due to
domain expertise defining specific terminology and relationships. For example,
in pharmaceuticals, 'Advil' \emph{is-a} brand name for 'ibuprofen' that can be
captured in a pharmaceutical ontology. While functional dependencies (FDs) have
traditionally been used in existing data cleaning solutions to model syntactic
equivalence, they are not able to model broader relationships (e.g., is-a)
defined by an ontology. In this paper, we take a first step towards extending
the set of data quality constraints used in data cleaning by defining and
discovering \emph{Ontology Functional Dependencies} (OFDs). We lay out
theoretical and practical foundations for OFDs, including a set of sound and
complete axioms, and a linear inference procedure. We then develop effective
algorithms for discovering OFDs, and a set of optimizations that efficiently
prune the search space. Our experimental evaluation using real data show the
scalability and accuracy of our algorithms.Comment: 12 page
- …