Search CORE

76 research outputs found

A Uniform Dependency Language for Improving Data Quality

Author: Fan Wenfei
Geerts Floris
Publication venue
Publication date: 01/01/2011
Field of study

Edinburgh Research Explorer

Institutional Repository Universiteit Antwerpen

Efficient Discovery of Ontology Functional Dependencies

Author: Baskaran Sridevi
Chiang Fei
Keller Alexander
Lukasz Golab
Szlichta Jaroslaw
Publication venue
Publication date: 23/05/2017
Field of study

Poor data quality has become a pervasive issue due to the increasing complexity and size of modern datasets. Constraint based data cleaning techniques rely on integrity constraints as a benchmark to identify and correct errors. Data values that do not satisfy the given set of constraints are flagged as dirty, and data updates are made to re-align the data and the constraints. However, many errors often require user input to resolve due to domain expertise defining specific terminology and relationships. For example, in pharmaceuticals, 'Advil' \emph{is-a} brand name for 'ibuprofen' that can be captured in a pharmaceutical ontology. While functional dependencies (FDs) have traditionally been used in existing data cleaning solutions to model syntactic equivalence, they are not able to model broader relationships (e.g., is-a) defined by an ontology. In this paper, we take a first step towards extending the set of data quality constraints used in data cleaning by defining and discovering \emph{Ontology Functional Dependencies} (OFDs). We lay out theoretical and practical foundations for OFDs, including a set of sound and complete axioms, and a linear inference procedure. We then develop effective algorithms for discovering OFDs, and a set of optimizations that efficiently prune the search space. Our experimental evaluation using real data show the scalability and accuracy of our algorithms.Comment: 12 page

arXiv.org e-Print Archive

Crossref

Towards Certain Fixes with Editing Rules and Master Data

Author: Fan Wenfei
Li Jianzhong
Ma Shuai
Tang Nan
Yu Wenyuan
Publication venue
Publication date: 01/01/2010
Field of study

A variety of integrity constraints have been studied for data cleaning. While these constraints can detect the presence of errors, they fall short of guiding us to correct the errors. Indeed, data repairing based on these constraints may not find certain fixes that are absolutely correct, and worse, may introduce new errors when repairing the data. We propose a method for finding certain fixes, based on master data, a notion of certain regions , and a class of editing rules . A certain region is a set of attributes that are assured correct by the users. Given a certain region and master data, editing rules tell us what attributes to fix and how to update them. We show how the method can be used in data monitoring and enrichment. We develop techniques for reasoning about editing rules, to decide whether they lead to a unique fix and whether they are able to fix all the attributes in a tuple, relative to master data and a certain region. We also provide an algorithm to identify minimal certain regions, such that a certain fix is warranted by editing rules and master data as long as one of the regions is correct. We experimentally verify the effectiveness and scalability of the algorithm. </jats:p

Crossref

Edinburgh Research Explorer

Data Doctor: An Efficient Data Profiling and Quality Improvement Tool

Author: Shruti Sarkate, Kaveri Tare, Ruchi Kamble, Narendra Gawai
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 30/04/2015
Field of study

Many business and IT managers face the same problem: the data that serves as the foundation for their business applications inconsistent,inaccurate,and unreliable. Data profiling is the solution to this problem and, as such, is a fundamental step that should begin every data-driven initiative.In this paper we have implemented the technique of data profiling such as Column Analysis,Frequency Analysis, Null Rule Analysis,Constant Analysis, Empty Column Analysis and Unique Analysis. DOI: 10.17762/ijritcc2321-8169.160411

International Journal on Recent and Innovation Trends in Computing and Communication

Detecting Inconsistencies in Distributed Data

Author: Fan Wenfei
Geerts Floris
Ma Shuai
Mueller Heiko
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Edinburgh Research Explorer

Towards a catalog of spreadsheet smells

Author: B.A. Nardi
E.F. Codd
F. Chiang
F. Hermans
I. Heitlager
J. Cunha
J. Cunha
J. Cunha
J. Cunha
M. Erwig
M. Fowler
M. Mäntylä
M.V. Mäntylä
R. Abraham
R. Abraham
R.R. Panko
R.R. Panko
V. Levenshtein
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Spreadsheets are considered to be the most widely used programming language in the world, and reports have shown that 90% of real-world spreadsheets contain errors. In this work, we try to identify spreadsheet smells, a concept adapted from software, which consists of a surface indication that usually corresponds to a deeper problem. Our smells have been integrated in a tool, and were computed for a large spreadsheet repository. Finally, the analysis of the results we obtained led to the refinement of our initial catalog

Universidade do Minho: RepositoriUM

Crossref

Approximation Measures for Conditional Functional Dependencies Using Stripped Conditional Partitions

Author: Arch-int Ngamnij
Arch-int Somjit
Duy Tran Anh
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/06/2017
Field of study

Conditional functional dependencies (CFDs) have been used to improve the quality of data, including detecting and repairing data inconsistencies. Approximation measures have significant importance for data dependencies in data mining. To adapt to exceptions in real data, the measures are used to relax the strictness of CFDs for more generalized dependencies, called approximate conditional functional dependencies (ACFDs). This paper analyzes the weaknesses of dependency degree, confidence and conviction measures for general CFDs (constant and variable CFDs). A new measure for general CFDs based on incomplete knowledge granularity is proposed to measure the approximation of these dependencies as well as the distribution of data tuples into the conditional equivalence classes. Finally, the effectiveness of stripped conditional partitions and this new measure are evaluated on synthetic and real data sets. These results are important to the study of theory of approximation dependencies and improvement of discovery algorithms of CFDs and ACFDs

Crossref

Institute of Advanced Engineering and Science