26 research outputs found
A revival of integrity constraints for data cleaning
Integrity constraints,
a.k.a
. data dependencies, are being widely used for improving
the quality of schema
. Recently constraints have enjoyed a revival for
improving the quality of data
. The tutorial aims to provide an overview of recent advances in constraint-based data cleaning.
</jats:p
SMOQE: A System for Providing Secure Access to XML
XML views have been widely used to enforce access control, support data integration, and speed up query answering. In many applications, e.g., XML security enforcement, it is prohibitively expensive to materialize and maintain a large number of views. Therefore, views are necessarily virtual. An immediate question then is how to answer queries on XML virtual views. A common approach is to rewrite a query on the view to an equivalent one on the underlying document, and evaluate the rewritten query. This is the approach used in the Secure MOdular Query Engine (SMOQE). The demo presents SMOQE, the first system to provide efficient support for answering queries over virtual and possibly recursively defined XML views. We demonstrate a set of novel techniques for the specification of views, the rewriting, evaluation and optimization of XML queries. Moreover, we provide insights into the internals of the engine by a set of visual tools. 1
Reasoning about Record Matching Rules
To accurately match records it is often necessary to utilize the semantics of the data. Functional dependencies (FDs) have proven useful in identifying tuples in a clean relation, based on the semantics of the data. For all the reasons that FDs and their inference are needed, it is also important to develop dependencies and their reasoning techniques for matching tuples from
unreliable
data sources. This paper investigates dependencies and their reasoning for record matching. (a) We introduce a class of
matching dependencies
(MDs) for specifying the semantics of data in unreliable relations, defined in terms of
similarity metrics
and a
dynamic semantics
. (b) We identify a special case of MDs, referred to as
relative candidate keys
(RCKs), to determine what attributes to compare and how to compare them when matching records across possibly different relations. (c) We propose a mechanism for inferring MDs, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. (d) We provide an
O
(
n
2
) time algorithm for inferring MDs, and an effective algorithm for deducing a set of RCKs from MDs. (e) We experimentally verify that the algorithms help matching tools efficiently identify keys at compile time for matching, blocking or windowing, and that the techniques effectively improve both the quality and efficiency of various record matching methods.
</jats:p
From Relations to XML: Cleaning, Integrating and Securing Data
While relational databases are still the preferred approach for storing data, XML is emerging
as the primary standard for representing and exchanging data. Consequently, it has
been increasingly important to provide a uniform XML interface to various data sources—
integration; and critical to protect sensitive and confidential information in XML data —
access control. Moreover, it is preferable to first detect and repair the inconsistencies in
the data to avoid the propagation of errors to other data processing steps. In response to
these challenges, this thesis presents an integrated framework for cleaning, integrating and
securing data.
The framework contains three parts. First, the data cleaning sub-framework makes
use of a new class of constraints specially designed for improving data quality, referred
to as conditional functional dependencies (CFDs), to detect and remove inconsistencies in
relational data. Both batch and incremental techniques are developed for detecting CFD
violations by SQL efficiently and repairing them based on a cost model. The cleaned relational
data, together with other non-XML data, is then converted to XML format by using
widely deployed XML publishing facilities. Second, the data integration sub-framework
uses a novel formalism, XML integration grammars (XIGs), to integrate multi-source XML
data which is either native or published from traditional databases. XIGs automatically
support conformance to a target DTD, and allow one to build a large, complex integration
via composition of component XIGs. To efficiently materialize the integrated data, algorithms
are developed for merging XML queries in XIGs and for scheduling them. Third, to
protect sensitive information in the integrated XML data, the data security sub-framework
allows users to access the data only through authorized views. User queries posed on these
views need to be rewritten into equivalent queries on the underlying document to avoid the
prohibitive cost of materializing and maintaining large number of views. Two algorithms
are proposed to support virtual XML views: a rewriting algorithm that characterizes the
rewritten queries as a new form of automata and an evaluation algorithm to execute the
automata-represented queries. They allow the security sub-framework to answer queries
on views in linear time.
Using both relational and XML technologies, this framework provides a uniform approach
to clean, integrate and secure data. The algorithms and techniques in the framework
have been implemented and the experimental study verifies their effectiveness and efficiency
Rewriting Regular XPath Queries on XML Views
We study the problem of answering queries posed on virtual views of XML documents, a problem commonly encountered when enforcing XML access control and integrating data. We approach the problem by rewriting queries on views into equivalent queries on the underlying document, and thus avoid the overhead of view materialization and maintenance. We consider possibly recursively defined XML views and study the rewriting of both XPath and regular XPath queries. We show that while rewriting is not always possible for XPath over recursive views, it is for regular XPath; however, the rewritten query may be of exponential size. To avoid this prohibitive cost we propose a rewriting algorithm that characterizes rewritten queries as a new form of automata, and an efficient algorithm to evaluate the automaton-represented queries. These allow us to answer queries on views in linear time. We have fully implemented a prototype system, SMOQE, which yields the first regular XPath engine and a practical solution for answering queries over possibly recursively defined XML views. 1
Conditional Functional Dependencies for Data Cleaning
We propose a class of constraints, referred to as conditional functional dependencies (CFDs), and study their applications in data cleaning. In contrast to traditional functional dependencies (FDs) that were developed mainly for schema design, CFDs aim at capturing the consistency of data by incorporating bindings of semantically related values. For CFDs we provide an inference system analogous to Armstrong’s axioms for FDs, as well as consistency analysis. Since CFDs allow data bindings, a large number of individual constraints may hold on a table, complicating detection of constraint violations. We develop techniques for detecting CFD violations in SQL as well as novel techniques for checking multiple constraints in a single query. We experimentally evaluate the performance of our CFD-based methods for inconsistency detection. This not only yields a constraint theory for CFDs butisalsoasteptowardapractical constraint-based method for improving data quality.