6 research outputs found
Towards Certain Fixes with Editing Rules and Master Data
A variety of integrity constraints have been studied for data cleaning. While these constraints can detect the presence of errors, they fall short of guiding us to correct the errors. Indeed, data repairing based on these constraints may not find
certain fixes
that are absolutely correct, and worse, may introduce new errors when repairing the data. We propose a method for finding certain fixes, based on master data, a notion of
certain regions
, and a class of
editing rules
. A certain region is a set of attributes that are assured correct by the users. Given a certain region and master data, editing rules tell us what attributes to fix and how to update them. We show how the method can be used in data monitoring and enrichment. We develop techniques for reasoning about editing rules, to decide whether they lead to a unique fix and whether they are able to fix all the attributes in a tuple,
relative
to master data and a certain region. We also provide an algorithm to identify minimal certain regions, such that a certain fix is warranted by editing rules and master data as long as one of the regions is correct. We experimentally verify the effectiveness and scalability of the algorithm.
</jats:p
Towards certain fixes with editing rules and master data
A variety of integrity constraints have been studied for data cleaning. While these constraints can detect the presence of errors, they fall short of guiding us to correct the errors. Indeed, data repairing based on these constraints may notfind certain fixes that are absolutely correct, and worse, may introduce new errors when repairing the data. We propose a method for finding certain fixes, based on master data, a notion of certain regions, and a class of editing rules. A certain region is a set of attributes that are assured correct by the users. Given a certain region and master data, editing rules tell us what attributes to fix and how to update them. We show how the method can be used in data monitoring and enrichment. We develop techniques for reasoning about editing rules, to decide whether they lead to a unique fix and whether they are able to fix all the attributes in a tuple, relative to master data and a certain region. We also provide an algorithm to identify minimal certain regions, such that a certain fix is warranted by editing rules and master data as long as one of the regions is correct. We experimentally verify the effectiveness and scalability of the algorithm
Towards certain fixes with editing rules and master data
Abstract A variety of integrity constraints have been studied for data cleaning. While these constraints can detect the presence of errors, they fall short of guiding us to correct the errors. Indeed, data repairing based on these constraints may not find certain fixes that are absolutely correct, and worse, may introduce new errors when repairing the data. We propose a method for finding certain fixes, based on master data, a notion of certain regions, and a class of editing rules. A certain region is a set of attributes that are assured correct by the users. Given a certain region and master data, editing rules tell us what attributes to fix and how to update them. We show how the method can be used in data monitoring and enrichment. We develop techniques for reasoning about editing rules, to decide whether they lead to a unique fix and whether they are able to fix all the attributes in a tuple, relative to master data and a certain region. We also provide an algorithm to identify minimal certain regions, such that a certain fix is warranted by editing rules and master data as long as one of the regions is correct. We experimentally verify the effectiveness and scalability of the algorithm
Advanced Ramsey-Based Büchi Automata Inclusion Testing
International audienceChecking language inclusion between two nondeterministic B ̈ chi au- u tomata A and B is computationally hard (PSPACE-complete). However, several approaches which are efficient in many practical cases have been proposed. We build on one of these, which is known as the Ramsey-based approach. It has recently been shown that the basic Ramsey-based approach can be drastically optimized by using powerful subsumption techniques, which allow one to prune the search-space when looking for counterexamples to inclusion. While previous works only used subsumption based on set inclusion or forward simulation on A and B , we propose the following new techniques: (1) A larger subsumption rela- tion based on a combination of backward and forward simulations on A and B . (2) A method to additionally use forward simulation between A and B . (3) Ab- straction techniques that can speed up the computation and lead to early detection of counterexamples. The new algorithm was implemented and tested on automata derived from real-world model checking benchmarks, and on the Tabakov-Vardi random model, thus showing the usefulness of the proposed techniques
Improving data quality : data consistency, deduplication, currency and accuracy
Data quality is one of the key problems in data management. An unprecedented amount of data has been accumulated and has become a valuable asset of an organization. The value of the data relies greatly on its quality. However, data is often dirty in real life. It may be inconsistent, duplicated, stale, inaccurate or incomplete, which can reduce its usability and increase the cost of businesses. Consequently the need for improving data quality arises, which comprises of five central issues of improving data quality, namely, data consistency, data deduplication, data currency, data accuracy and information completeness. This thesis presents the results of our work on the first four issues with regards to data consistency, deduplication, currency and accuracy. The first part of the thesis investigates incremental verifications of data consistencies in distributed data. Given a distributed database D, a set S of conditional functional dependencies (CFDs), the set V of violations of the CFDs in D, and updates ΔD to D, it is to find, with minimum data shipment, changes ΔV to V in response to ΔD. Although the problems are intractable, we show that they are bounded: there exist algorithms to detect errors such that their computational cost and data shipment are both linear in the size of ΔD and ΔV, independent of the size of the database D. Such incremental algorithms are provided for both vertically and horizontally partitioned data, and we show that the algorithms are optimal. The second part of the thesis studies the interaction between record matching and data repairing. Record matching, the main technique underlying data deduplication, aims to identify tuples that refer to the same real-world object, and repairing is to make a database consistent by fixing errors in the data using constraints. These are treated as separate processes in most data cleaning systems, based on heuristic solutions. However, our studies show that repairing can effectively help us identify matches, and vice versa. To capture the interaction, a uniform framework that seamlessly unifies repairing and matching operations is proposed to clean a database based on integrity constraints, matching rules and master data. The third part of the thesis presents our study of finding certain fixes that are absolutely correct for data repairing. Data repairing methods based on integrity constraints are normally heuristic, and they may not find certain fixes. Worse still, they may even introduce new errors when attempting to repair the data, which may not work well when repairing critical data such as medical records, in which a seemingly minor error often has disastrous consequences. We propose a framework and an algorithm to find certain fixes, based on master data, a class of editing rules and user interactions. A prototype system is also developed. The fourth part of the thesis introduces inferring data currency and consistency for conflict resolution, where data currency aims to identify the current values of entities, and conflict resolution is to combine tuples that pertain to the same real-world entity into a single tuple and resolve conflicts, which is also an important issue for data deduplication. We show that data currency and consistency help each other in resolving conflicts. We study a number of associated fundamental problems, and develop an approach for conflict resolution by inferring data currency and consistency. The last part of the thesis reports our study of data accuracy on the longstanding relative accuracy problem which is to determine, given tuples t1 and t2 that refer to the same entity e, whether t1[A] is more accurate than t2[A], i.e., t1[A] is closer to the true value of the A attribute of e than t2[A]. We introduce a class of accuracy rules and an inference system with a chase procedure to deduce relative accuracy, and the related fundamental problems are studied. We also propose a framework and algorithms for inferring accurate values with users’ interaction
Extending dependencies for improving data quality
This doctoral thesis presents the results of my work on extending dependencies for
improving data quality, both in a centralized environment with a single database and
in a data exchange and integration environment with multiple databases.
The first part of the thesis proposes five classes of data dependencies, referred to as
CINDs, eCFDs, CFDcs, CFDps and CINDps, to capture data inconsistencies commonly
found in practice in a centralized environment. For each class of these dependencies,
we investigate two central problems: the satisfiability problem and the implication
problem. The satisfiability problem is to determine given a set Σ of dependencies
defined on a database schema R, whether or not there exists a nonempty database D of
R that satisfies Σ. And the implication problem is to determine whether or not a set Σ
of dependencies defined on a database schema R entails another dependency φ on R.
That is, for each database D ofRthat satisfies Σ, the D must satisfy φ as well. These are
important for the validation and optimization of data-cleaning processes. We establish
complexity results of the satisfiability problem and the implication problem for all
these five classes of dependencies, both in the absence of finite-domain attributes and in
the general setting with finite-domain attributes. Moreover, SQL-based techniques are
developed to detect data inconsistencies for each class of the proposed dependencies,
which can be easily implemented on the top of current database management systems.
The second part of the thesis studies three important topics for data cleaning in a
data exchange and integration environment with multiple databases.
One is the dependency propagation problem, which is to determine, given a view
defined on data sources and a set of dependencies on the sources, whether another
dependency is guaranteed to hold on the view. We investigate dependency propagation
for views defined in various fragments of relational algebra, conditional functional
dependencies (CFDs) [FGJK08] as view dependencies, and for source dependencies
given as either CFDs or traditional functional dependencies (FDs). And we establish
lower and upper bounds, all matching, ranging from PTIME to undecidable. These not
only provide the first results for CFD propagation, but also extend the classical work
of FD propagation by giving new complexity bounds in the presence of a setting with
finite domains. We finally provide the first algorithm for computing a minimal cover of
all CFDs propagated via SPC views. The algorithm has the same complexity as one of
the most efficient algorithms for computing a cover of FDs propagated via a projection
view, despite the increased expressive power of CFDs and SPC views. Another one is matching records from unreliable data sources. A class of matching
dependencies (MDs) is introduced for specifying the semantics of unreliable data. As
opposed to static constraints for schema design such as FDs, MDs are developed for
record matching, and are defined in terms of similarity metrics and a dynamic semantics. We identify a special case of MDs, referred to as relative candidate keys (RCKs),
to determine what attributes to compare and how to compare them when matching
records across possibly different relations. We also propose a mechanism for inferring MDs with a sound and complete system, a departure from traditional implication
analysis, such that when we cannot match records by comparing attributes that contain
errors, we may still find matches by using other, more reliable attributes. We finally
provide a quadratic time algorithm for inferring MDs, and an effective algorithm for
deducing quality RCKs from a given set of MDs.
The last one is finding certain fixes for data monitoring [CGGM03, SMO07], which
is to find and correct errors in a tuple when it is created, either entered manually or
generated by some process. That is, we want to ensure that a tuple t is clean before it
is used, to prevent errors introduced by adding t. As noted by [SMO07], it is far less
costly to correct a tuple at the point of entry than fixing it afterward.
Data repairing based on integrity constraints may not find certain fixes that are
absolutely correct, and worse, may introduce new errors when repairing the data. We
propose a method for finding certain fixes, based on master data, a notion of certain
regions, and a class of editing rules. A certain region is a set of attributes that are
assured correct by the users. Given a certain region and master data, editing rules tell
us what attributes to fix and how to update them. We show how the method can be used
in data monitoring and enrichment. We develop techniques for reasoning about editing
rules, to decide whether they lead to a unique fix and whether they are able to fix all
the attributes in a tuple, relative to master data and a certain region. We also provide
an algorithm to identify minimal certain regions, such that a certain fix is warranted by
editing rules and master data as long as one of the regions is correct