4,561 research outputs found
Functional Dependencies Unleashed for Scalable Data Exchange
We address the problem of efficiently evaluating target functional
dependencies (fds) in the Data Exchange (DE) process. Target fds naturally
occur in many DE scenarios, including the ones in Life Sciences in which
multiple source relations need to be structured under a constrained target
schema. However, despite their wide use, target fds' evaluation is still a
bottleneck in the state-of-the-art DE engines. Systems relying on an all-SQL
approach typically do not support target fds unless additional information is
provided. Alternatively, DE engines that do include these dependencies
typically pay the price of a significant drop in performance and scalability.
In this paper, we present a novel chase-based algorithm that can efficiently
handle arbitrary fds on the target. Our approach essentially relies on
exploiting the interactions between source-to-target (s-t) tuple-generating
dependencies (tgds) and target fds. This allows us to tame the size of the
intermediate chase results, by playing on a careful ordering of chase steps
interleaving fds and (chosen) tgds. As a direct consequence, we importantly
diminish the fd application scope, often a central cause of the dramatic
overhead induced by target fds. Moreover, reasoning on dependency interaction
further leads us to interesting parallelization opportunities, yielding
additional scalability gains. We provide a proof-of-concept implementation of
our chase-based algorithm and an experimental study aiming at gauging its
scalability with respect to a number of parameters, among which the size of
source instances and the number of dependencies of each tested scenario.
Finally, we empirically compare with the latest DE engines, and show that our
algorithm outperforms them
Describing the complexity of systems: multi-variable "set complexity" and the information basis of systems biology
Context dependence is central to the description of complexity. Keying on the
pairwise definition of "set complexity" we use an information theory approach
to formulate general measures of systems complexity. We examine the properties
of multi-variable dependency starting with the concept of interaction
information. We then present a new measure for unbiased detection of
multi-variable dependency, "differential interaction information." This
quantity for two variables reduces to the pairwise "set complexity" previously
proposed as a context-dependent measure of information in biological systems.
We generalize it here to an arbitrary number of variables. Critical limiting
properties of the "differential interaction information" are key to the
generalization. This measure extends previous ideas about biological
information and provides a more sophisticated basis for study of complexity.
The properties of "differential interaction information" also suggest new
approaches to data analysis. Given a data set of system measurements
differential interaction information can provide a measure of collective
dependence, which can be represented in hypergraphs describing complex system
interaction patterns. We investigate this kind of analysis using simulated data
sets. The conjoining of a generalized set complexity measure, multi-variable
dependency analysis, and hypergraphs is our central result. While our focus is
on complex biological systems, our results are applicable to any complex
system.Comment: 44 pages, 12 figures; made revisions after peer revie
Probabilistic Relational Model Benchmark Generation
The validation of any database mining methodology goes through an evaluation
process where benchmarks availability is essential. In this paper, we aim to
randomly generate relational database benchmarks that allow to check
probabilistic dependencies among the attributes. We are particularly interested
in Probabilistic Relational Models (PRMs), which extend Bayesian Networks (BNs)
to a relational data mining context and enable effective and robust reasoning
over relational data. Even though a panoply of works have focused, separately ,
on the generation of random Bayesian networks and relational databases, no work
has been identified for PRMs on that track. This paper provides an algorithmic
approach for generating random PRMs from scratch to fill this gap. The proposed
method allows to generate PRMs as well as synthetic relational data from a
randomly generated relational schema and a random set of probabilistic
dependencies. This can be of interest not only for machine learning researchers
to evaluate their proposals in a common framework, but also for databases
designers to evaluate the effectiveness of the components of a database
management system
Simplification of UML/OCL schemas for efficient reasoning
Ensuring the correctness of a conceptual schema is an essential task in order to avoid the propagation of errors during software development. The kind of reasoning required to perform such task is known to be exponential for UML class diagrams alone and even harder when considering OCL constraints. Motivated by this issue, we propose an innovative method aimed at removing constraints and other UML elements of the schema to obtain a simplified one that preserve the same reasoning outcomes. In this way, we can reason about the correctness of the initial artifact by reasoning on a simplified version of it. Thus, the efficiency of the reasoning process is significantly improved. In addition, since our method is independent from the reasoning engine used, any reasoning method may benefit from it.Peer ReviewedPostprint (author's final draft
The Stochastic complexity of spin models: Are pairwise models really simple?
Models can be simple for different reasons: because they yield a simple and
computationally efficient interpretation of a generic dataset (e.g. in terms of
pairwise dependences) - as in statistical learning - or because they capture
the essential ingredients of a specific phenomenon - as e.g. in physics -
leading to non-trivial falsifiable predictions. In information theory and
Bayesian inference, the simplicity of a model is precisely quantified in the
stochastic complexity, which measures the number of bits needed to encode its
parameters. In order to understand how simple models look like, we study the
stochastic complexity of spin models with interactions of arbitrary order. We
highlight the existence of invariances with respect to bijections within the
space of operators, which allow us to partition the space of all models into
equivalence classes, in which models share the same complexity. We thus found
that the complexity (or simplicity) of a model is not determined by the order
of the interactions, but rather by their mutual arrangements. Models where
statistical dependencies are localized on non-overlapping groups of few
variables (and that afford predictions on independencies that are easy to
falsify) are simple. On the contrary, fully connected pairwise models, which
are often used in statistical learning, appear to be highly complex, because of
their extended set of interactions
- …