While relational databases are still the preferred approach for storing data, XML is emerging
as the primary standard for representing and exchanging data. Consequently, it has
been increasingly important to provide a uniform XML interface to various data sources—
integration; and critical to protect sensitive and confidential information in XML data —
access control. Moreover, it is preferable to first detect and repair the inconsistencies in
the data to avoid the propagation of errors to other data processing steps. In response to
these challenges, this thesis presents an integrated framework for cleaning, integrating and
securing data.
The framework contains three parts. First, the data cleaning sub-framework makes
use of a new class of constraints specially designed for improving data quality, referred
to as conditional functional dependencies (CFDs), to detect and remove inconsistencies in
relational data. Both batch and incremental techniques are developed for detecting CFD
violations by SQL efficiently and repairing them based on a cost model. The cleaned relational
data, together with other non-XML data, is then converted to XML format by using
widely deployed XML publishing facilities. Second, the data integration sub-framework
uses a novel formalism, XML integration grammars (XIGs), to integrate multi-source XML
data which is either native or published from traditional databases. XIGs automatically
support conformance to a target DTD, and allow one to build a large, complex integration
via composition of component XIGs. To efficiently materialize the integrated data, algorithms
are developed for merging XML queries in XIGs and for scheduling them. Third, to
protect sensitive information in the integrated XML data, the data security sub-framework
allows users to access the data only through authorized views. User queries posed on these
views need to be rewritten into equivalent queries on the underlying document to avoid the
prohibitive cost of materializing and maintaining large number of views. Two algorithms
are proposed to support virtual XML views: a rewriting algorithm that characterizes the
rewritten queries as a new form of automata and an evaluation algorithm to execute the
automata-represented queries. They allow the security sub-framework to answer queries
on views in linear time.
Using both relational and XML technologies, this framework provides a uniform approach
to clean, integrate and secure data. The algorithms and techniques in the framework
have been implemented and the experimental study verifies their effectiveness and efficiency