49,011 research outputs found
Integrity Constraints Revisited: From Exact to Approximate Implication
Integrity constraints such as functional dependencies (FD), and multi-valued
dependencies (MVD) are fundamental in database schema design. Likewise,
probabilistic conditional independences (CI) are crucial for reasoning about
multivariate probability distributions. The implication problem studies whether
a set of constraints (antecedents) implies another constraint (consequent), and
has been investigated in both the database and the AI literature, under the
assumption that all constraints hold exactly. However, many applications today
consider constraints that hold only approximately. In this paper we define an
approximate implication as a linear inequality between the degree of
satisfaction of the antecedents and consequent, and we study the relaxation
problem: when does an exact implication relax to an approximate implication? We
use information theory to define the degree of satisfaction, and prove several
results. First, we show that any implication from a set of data dependencies
(MVDs+FDs) can be relaxed to a simple linear inequality with a factor at most
quadratic in the number of variables; when the consequent is an FD, the factor
can be reduced to 1. Second, we prove that there exists an implication between
CIs that does not admit any relaxation; however, we prove that every
implication between CIs relaxes "in the limit". Finally, we show that the
implication problem for differential constraints in market basket analysis also
admits a relaxation with a factor equal to 1. Our results recover, and
sometimes extend, several previously known results about the implication
problem: implication of MVDs can be checked by considering only 2-tuple
relations, and the implication of differential constraints for frequent item
sets can be checked by considering only databases containing a single
transaction
Integrity Constraints Revisited: From Exact to Approximate Implication
Integrity constraints such as functional dependencies (FD), and multi-valued dependencies (MVD) are fundamental in database schema design. Likewise, probabilistic conditional independences (CI) are crucial for reasoning about multivariate probability distributions. The implication problem studies whether a set of constraints (antecedents) implies another constraint (consequent), and has been investigated in both the database and the AI literature, under the assumption that all constraints hold exactly. However, many applications today consider constraints that hold only approximately. In this paper we define an approximate implication as a linear inequality between the degree of satisfaction of the antecedents and consequent, and we study the relaxation problem: when does an exact implication relax to an approximate implication? We use information theory to define the degree of satisfaction, and prove several results. First, we show that any implication from a set of data dependencies (MVDs+FDs) can be relaxed to a simple linear inequality with a factor at most quadratic in the number of variables; when the consequent is an FD, the factor can be reduced to 1. Second, we prove that there exists an implication between CIs that does not admit any relaxation; however, we prove that every implication between CIs relaxes "in the limit". Finally, we show that the implication problem for differential constraints in market basket analysis also admits a relaxation with a factor equal to 1. Our results recover, and sometimes extend, several previously known results about the implication problem: implication of MVDs can be checked by considering only 2-tuple relations, and the implication of differential constraints for frequent item sets can be checked by considering only databases containing a single transaction
Distribution Constraints: The Chase for Distributed Data
This paper introduces a declarative framework to specify and reason about distributions of data over computing nodes in a distributed setting. More specifically, it proposes distribution constraints which are tuple and equality generating dependencies (tgds and egds) extended with node variables ranging over computing nodes. In particular, they can express co-partitioning constraints and constraints about range-based data distributions by using comparison atoms. The main technical contribution is the study of the implication problem of distribution constraints. While implication is undecidable in general, relevant fragments of so-called data-full constraints are exhibited for which the corresponding implication problems are complete for EXPTIME, PSPACE and NP. These results yield bounds on deciding parallel-correctness for conjunctive queries in the presence of distribution constraints
Finite Open-World Query Answering with Number Restrictions (Extended Version)
Open-world query answering is the problem of deciding, given a set of facts,
conjunction of constraints, and query, whether the facts and constraints imply
the query. This amounts to reasoning over all instances that include the facts
and satisfy the constraints. We study finite open-world query answering (FQA),
which assumes that the underlying world is finite and thus only considers the
finite completions of the instance. The major known decidable cases of FQA
derive from the following: the guarded fragment of first-order logic, which can
express referential constraints (data in one place points to data in another)
but cannot express number restrictions such as functional dependencies; and the
guarded fragment with number restrictions but on a signature of arity only two.
In this paper, we give the first decidability results for FQA that combine both
referential constraints and number restrictions for arbitrary signatures: we
show that, for unary inclusion dependencies and functional dependencies, the
finiteness assumption of FQA can be lifted up to taking the finite implication
closure of the dependencies. Our result relies on new techniques to construct
finite universal models of such constraints, for any bound on the maximal query
size.Comment: 59 pages. To appear in LICS 2015. Extended version including proof
Integrity Constraints Revisited: From Exact to Approximate Implication
Integrity constraints such as functional dependencies (FD) and multi-valued
dependencies (MVD) are fundamental in database schema design. Likewise,
probabilistic conditional independences (CI) are crucial for reasoning about
multivariate probability distributions. The implication problem studies whether
a set of constraints (antecedents) implies another constraint (consequent), and
has been investigated in both the database and the AI literature, under the
assumption that all constraints hold exactly. However, many applications today
consider constraints that hold only approximately. In this paper we define an
approximate implication as a linear inequality between the degree of
satisfaction of the antecedents and consequent, and we study the relaxation
problem: when does an exact implication relax to an approximate implication? We
use information theory to define the degree of satisfaction, and prove several
results. First, we show that any implication from a set of data dependencies
(MVDs+FDs) can be relaxed to a simple linear inequality with a factor at most
quadratic in the number of variables; when the consequent is an FD, the factor
can be reduced to 1. Second, we prove that there exists an implication between
CIs that does not admit any relaxation; however, we prove that every
implication between CIs relaxes "in the limit". Then, we show that the
implication problem for differential constraints in market basket analysis also
admits a relaxation with a factor equal to 1. Finally, we show how some of the
results in the paper can be derived using the I-measure theory, which relates
between information theoretic measures and set theory. Our results recover, and
sometimes extend, previously known results about the implication problem: the
implication of MVDs and FDs can be checked by considering only 2-tuple
relations
A Method for Mapping XML DTD to Relational Schemas In The Presence Of Functional Dependencies
The eXtensible Markup Language (XML) has recently emerged as a standard for
data representation and interchange on the web. As a lot of XML data in the web,
now the pressure is to manage the data efficiently. Given the fact that relational
databases are the most widely used technology for managing and storing XML,
therefore XML needs to map to relations and this process is one that occurs
frequently. There are many different ways to map and many approaches exist in the
literature especially considering the flexible nesting structures that XML allows. This
gives rise to the following important problem: Are some mappings ‘better’ than the
others? To approach this problem, the classical relational database design through
normalization technique that based on known functional dependency concept is
referred. This concept is used to specify the constraints that may exist in the relations
and guide the design while removing semantic data redundancies. This approach
leads to a good normalized relational schema without data redundancy. To achieve a
good normalized relational schema for XML, there is a need to extend the concept of
functional dependency in relations to XML and use this concept as guidance for the
design. Even though there exist functional dependency definitions for XML, but these definitions are not standard yet and still having several limitation. Due to the
limitations of the existing definitions, constraints in the presence of shared and local
elements that exist in XML document cannot be specified. In this study a new
definition of functional dependency constraints for XML is proposed that are general
enough to specify constraints and to discover semantic redundancies in XML
documents.
The focus of this study is on how to produce an optimal mapping approach in the
presence of XML functional dependencies (XFD), keys and Data Type Definition
(DTD) constraints, as a guidance to generate a good relational schema. To approach
the mapping problem, three different components are explored: the mapping
algorithm, functional dependency for XML, and implication process. The study of
XML implication is important to imply what other dependencies that are guaranteed
to hold in a relational representation of XML, given that a set of functional
dependencies holds in the XML document. This leads to the needs of deriving a set
of inference rules for the implication process. In the presence of DTD and userdefined
XFD, other set of XFDs that are guaranteed to hold in XML can be
generated using the set of inference rules. This mapping algorithm has been
developed within the tool called XtoR. The quality of the mapping approach has
been analyzed, and the result shows that the mapping approach (XtoR) significantly
improve in terms of generating a good relational schema for XML with respect to
reduce data and relation redundancy, remove dangling relations and remove
association problems. The findings suggest that if one wants to use RDBMS to
manage XML data, the mapping from XML document to relations must based be on
functional dependency constraints
Extending dependencies for improving data quality
This doctoral thesis presents the results of my work on extending dependencies for
improving data quality, both in a centralized environment with a single database and
in a data exchange and integration environment with multiple databases.
The first part of the thesis proposes five classes of data dependencies, referred to as
CINDs, eCFDs, CFDcs, CFDps and CINDps, to capture data inconsistencies commonly
found in practice in a centralized environment. For each class of these dependencies,
we investigate two central problems: the satisfiability problem and the implication
problem. The satisfiability problem is to determine given a set Σ of dependencies
defined on a database schema R, whether or not there exists a nonempty database D of
R that satisfies Σ. And the implication problem is to determine whether or not a set Σ
of dependencies defined on a database schema R entails another dependency φ on R.
That is, for each database D ofRthat satisfies Σ, the D must satisfy φ as well. These are
important for the validation and optimization of data-cleaning processes. We establish
complexity results of the satisfiability problem and the implication problem for all
these five classes of dependencies, both in the absence of finite-domain attributes and in
the general setting with finite-domain attributes. Moreover, SQL-based techniques are
developed to detect data inconsistencies for each class of the proposed dependencies,
which can be easily implemented on the top of current database management systems.
The second part of the thesis studies three important topics for data cleaning in a
data exchange and integration environment with multiple databases.
One is the dependency propagation problem, which is to determine, given a view
defined on data sources and a set of dependencies on the sources, whether another
dependency is guaranteed to hold on the view. We investigate dependency propagation
for views defined in various fragments of relational algebra, conditional functional
dependencies (CFDs) [FGJK08] as view dependencies, and for source dependencies
given as either CFDs or traditional functional dependencies (FDs). And we establish
lower and upper bounds, all matching, ranging from PTIME to undecidable. These not
only provide the first results for CFD propagation, but also extend the classical work
of FD propagation by giving new complexity bounds in the presence of a setting with
finite domains. We finally provide the first algorithm for computing a minimal cover of
all CFDs propagated via SPC views. The algorithm has the same complexity as one of
the most efficient algorithms for computing a cover of FDs propagated via a projection
view, despite the increased expressive power of CFDs and SPC views. Another one is matching records from unreliable data sources. A class of matching
dependencies (MDs) is introduced for specifying the semantics of unreliable data. As
opposed to static constraints for schema design such as FDs, MDs are developed for
record matching, and are defined in terms of similarity metrics and a dynamic semantics. We identify a special case of MDs, referred to as relative candidate keys (RCKs),
to determine what attributes to compare and how to compare them when matching
records across possibly different relations. We also propose a mechanism for inferring MDs with a sound and complete system, a departure from traditional implication
analysis, such that when we cannot match records by comparing attributes that contain
errors, we may still find matches by using other, more reliable attributes. We finally
provide a quadratic time algorithm for inferring MDs, and an effective algorithm for
deducing quality RCKs from a given set of MDs.
The last one is finding certain fixes for data monitoring [CGGM03, SMO07], which
is to find and correct errors in a tuple when it is created, either entered manually or
generated by some process. That is, we want to ensure that a tuple t is clean before it
is used, to prevent errors introduced by adding t. As noted by [SMO07], it is far less
costly to correct a tuple at the point of entry than fixing it afterward.
Data repairing based on integrity constraints may not find certain fixes that are
absolutely correct, and worse, may introduce new errors when repairing the data. We
propose a method for finding certain fixes, based on master data, a notion of certain
regions, and a class of editing rules. A certain region is a set of attributes that are
assured correct by the users. Given a certain region and master data, editing rules tell
us what attributes to fix and how to update them. We show how the method can be used
in data monitoring and enrichment. We develop techniques for reasoning about editing
rules, to decide whether they lead to a unique fix and whether they are able to fix all
the attributes in a tuple, relative to master data and a certain region. We also provide
an algorithm to identify minimal certain regions, such that a certain fix is warranted by
editing rules and master data as long as one of the regions is correct
Clustering Dependencies over Relational Tables
Integrity constraints have proven to be valuable in the database field. Not only can they help schema design (functional dependencies, FDs [1][2]), they can also be used in query optimization (ordering dependencies, ODs [4][5][8][9]), or data cleaning (conditional functional dependencies, CFDs [12] and denial constraints, DCs [14]). In this thesis, however, we will introduce a new type of integrity constraint, called a clustering dependency (CD).
Similar to ordering dependencies which rely on the database operation ORDER BY, clustering dependencies focus on studying the operation GROUP BY. Furthermore, we claim that clustering dependencies are useful not only in query optimization as most integrity constraints do, but also useful in data visualization, data analysis and MapReduce.
In this thesis, we first introduce some examples of clustering dependencies in a real-life dataset. We then formally define clustering dependencies and elaborate on our motivation. We will also look into the reasoning system for clustering dependencies including the implication problem, consistency problem and influence rules for clustering dependencies. After that, we will propose two algorithms for clustering dependencies, first a checking algorithm that is able to check if a given dependency is valid in a table within O(N*M) time, with N being the number of rows and M being the size of potentially aggregated attributes, a.k.a, the size of the right-hand-side attributes. Secondly, we propose a mining algorithm that is able to discover all potential clustering dependencies occurring in a table. Finally, we will use both synthetic and real-life data to test the performance of our mining algorithm
On Independence Atoms and Keys
Uniqueness and independence are two fundamental properties of data. Their
enforcement in database systems can lead to higher quality data, faster data
service response time, better data-driven decision making and knowledge
discovery from data. The applications can be effectively unlocked by providing
efficient solutions to the underlying implication problems of keys and
independence atoms. Indeed, for the sole class of keys and the sole class of
independence atoms the associated finite and general implication problems
coincide and enjoy simple axiomatizations. However, the situation changes
drastically when keys and independence atoms are combined. We show that the
finite and the general implication problems are already different for keys and
unary independence atoms. Furthermore, we establish a finite axiomatization for
the general implication problem, and show that the finite implication problem
does not enjoy a k-ary axiomatization for any k
- …