76 research outputs found
Efficient Discovery of Ontology Functional Dependencies
Poor data quality has become a pervasive issue due to the increasing
complexity and size of modern datasets. Constraint based data cleaning
techniques rely on integrity constraints as a benchmark to identify and correct
errors. Data values that do not satisfy the given set of constraints are
flagged as dirty, and data updates are made to re-align the data and the
constraints. However, many errors often require user input to resolve due to
domain expertise defining specific terminology and relationships. For example,
in pharmaceuticals, 'Advil' \emph{is-a} brand name for 'ibuprofen' that can be
captured in a pharmaceutical ontology. While functional dependencies (FDs) have
traditionally been used in existing data cleaning solutions to model syntactic
equivalence, they are not able to model broader relationships (e.g., is-a)
defined by an ontology. In this paper, we take a first step towards extending
the set of data quality constraints used in data cleaning by defining and
discovering \emph{Ontology Functional Dependencies} (OFDs). We lay out
theoretical and practical foundations for OFDs, including a set of sound and
complete axioms, and a linear inference procedure. We then develop effective
algorithms for discovering OFDs, and a set of optimizations that efficiently
prune the search space. Our experimental evaluation using real data show the
scalability and accuracy of our algorithms.Comment: 12 page
Towards Certain Fixes with Editing Rules and Master Data
A variety of integrity constraints have been studied for data cleaning. While these constraints can detect the presence of errors, they fall short of guiding us to correct the errors. Indeed, data repairing based on these constraints may not find
certain fixes
that are absolutely correct, and worse, may introduce new errors when repairing the data. We propose a method for finding certain fixes, based on master data, a notion of
certain regions
, and a class of
editing rules
. A certain region is a set of attributes that are assured correct by the users. Given a certain region and master data, editing rules tell us what attributes to fix and how to update them. We show how the method can be used in data monitoring and enrichment. We develop techniques for reasoning about editing rules, to decide whether they lead to a unique fix and whether they are able to fix all the attributes in a tuple,
relative
to master data and a certain region. We also provide an algorithm to identify minimal certain regions, such that a certain fix is warranted by editing rules and master data as long as one of the regions is correct. We experimentally verify the effectiveness and scalability of the algorithm.
</jats:p
Data Doctor: An Efficient Data Profiling and Quality Improvement Tool
Many business and IT managers face the same problem: the data that serves as the foundation for their business applications inconsistent,inaccurate,and unreliable. Data profiling is the solution to this problem and, as such, is a fundamental step that should begin every data-driven initiative.In this paper we have implemented the technique of data profiling such as Column Analysis,Frequency Analysis, Null Rule Analysis,Constant Analysis, Empty Column Analysis and Unique Analysis.
DOI: 10.17762/ijritcc2321-8169.160411
Towards a catalog of spreadsheet smells
Spreadsheets are considered to be the most widely used programming language in the world, and reports have shown that 90% of real-world spreadsheets contain errors. In this work, we try to identify spreadsheet smells, a concept adapted from software, which consists of a surface indication that usually corresponds to a deeper problem. Our smells have been integrated in a tool, and were computed for a large spreadsheet repository. Finally, the analysis of the results we obtained led to the refinement of our initial catalog
Approximation Measures for Conditional Functional Dependencies Using Stripped Conditional Partitions
Conditional functional dependencies (CFDs) have been used to improve the quality of data, including detecting and repairing data inconsistencies. Approximation measures have significant importance for data dependencies in data mining. To adapt to exceptions in real data, the measures are used to relax the strictness of CFDs for more generalized dependencies, called approximate conditional functional dependencies (ACFDs). This paper analyzes the weaknesses of dependency degree, confidence and conviction measures for general CFDs (constant and variable CFDs). A new measure for general CFDs based on incomplete knowledge granularity is proposed to measure the approximation of these dependencies as well as the distribution of data tuples into the conditional equivalence classes. Finally, the effectiveness of stripped conditional partitions and this new measure are evaluated on synthetic and real data sets. These results are important to the study of theory of approximation dependencies and improvement of discovery algorithms of CFDs and ACFDs
- ā¦