4,561 research outputs found

    Functional Dependencies Unleashed for Scalable Data Exchange

    Full text link
    We address the problem of efficiently evaluating target functional dependencies (fds) in the Data Exchange (DE) process. Target fds naturally occur in many DE scenarios, including the ones in Life Sciences in which multiple source relations need to be structured under a constrained target schema. However, despite their wide use, target fds' evaluation is still a bottleneck in the state-of-the-art DE engines. Systems relying on an all-SQL approach typically do not support target fds unless additional information is provided. Alternatively, DE engines that do include these dependencies typically pay the price of a significant drop in performance and scalability. In this paper, we present a novel chase-based algorithm that can efficiently handle arbitrary fds on the target. Our approach essentially relies on exploiting the interactions between source-to-target (s-t) tuple-generating dependencies (tgds) and target fds. This allows us to tame the size of the intermediate chase results, by playing on a careful ordering of chase steps interleaving fds and (chosen) tgds. As a direct consequence, we importantly diminish the fd application scope, often a central cause of the dramatic overhead induced by target fds. Moreover, reasoning on dependency interaction further leads us to interesting parallelization opportunities, yielding additional scalability gains. We provide a proof-of-concept implementation of our chase-based algorithm and an experimental study aiming at gauging its scalability with respect to a number of parameters, among which the size of source instances and the number of dependencies of each tested scenario. Finally, we empirically compare with the latest DE engines, and show that our algorithm outperforms them

    Describing the complexity of systems: multi-variable "set complexity" and the information basis of systems biology

    Full text link
    Context dependence is central to the description of complexity. Keying on the pairwise definition of "set complexity" we use an information theory approach to formulate general measures of systems complexity. We examine the properties of multi-variable dependency starting with the concept of interaction information. We then present a new measure for unbiased detection of multi-variable dependency, "differential interaction information." This quantity for two variables reduces to the pairwise "set complexity" previously proposed as a context-dependent measure of information in biological systems. We generalize it here to an arbitrary number of variables. Critical limiting properties of the "differential interaction information" are key to the generalization. This measure extends previous ideas about biological information and provides a more sophisticated basis for study of complexity. The properties of "differential interaction information" also suggest new approaches to data analysis. Given a data set of system measurements differential interaction information can provide a measure of collective dependence, which can be represented in hypergraphs describing complex system interaction patterns. We investigate this kind of analysis using simulated data sets. The conjoining of a generalized set complexity measure, multi-variable dependency analysis, and hypergraphs is our central result. While our focus is on complex biological systems, our results are applicable to any complex system.Comment: 44 pages, 12 figures; made revisions after peer revie

    Probabilistic Relational Model Benchmark Generation

    Get PDF
    The validation of any database mining methodology goes through an evaluation process where benchmarks availability is essential. In this paper, we aim to randomly generate relational database benchmarks that allow to check probabilistic dependencies among the attributes. We are particularly interested in Probabilistic Relational Models (PRMs), which extend Bayesian Networks (BNs) to a relational data mining context and enable effective and robust reasoning over relational data. Even though a panoply of works have focused, separately , on the generation of random Bayesian networks and relational databases, no work has been identified for PRMs on that track. This paper provides an algorithmic approach for generating random PRMs from scratch to fill this gap. The proposed method allows to generate PRMs as well as synthetic relational data from a randomly generated relational schema and a random set of probabilistic dependencies. This can be of interest not only for machine learning researchers to evaluate their proposals in a common framework, but also for databases designers to evaluate the effectiveness of the components of a database management system

    Simplification of UML/OCL schemas for efficient reasoning

    Get PDF
    Ensuring the correctness of a conceptual schema is an essential task in order to avoid the propagation of errors during software development. The kind of reasoning required to perform such task is known to be exponential for UML class diagrams alone and even harder when considering OCL constraints. Motivated by this issue, we propose an innovative method aimed at removing constraints and other UML elements of the schema to obtain a simplified one that preserve the same reasoning outcomes. In this way, we can reason about the correctness of the initial artifact by reasoning on a simplified version of it. Thus, the efficiency of the reasoning process is significantly improved. In addition, since our method is independent from the reasoning engine used, any reasoning method may benefit from it.Peer ReviewedPostprint (author's final draft

    The Stochastic complexity of spin models: Are pairwise models really simple?

    Full text link
    Models can be simple for different reasons: because they yield a simple and computationally efficient interpretation of a generic dataset (e.g. in terms of pairwise dependences) - as in statistical learning - or because they capture the essential ingredients of a specific phenomenon - as e.g. in physics - leading to non-trivial falsifiable predictions. In information theory and Bayesian inference, the simplicity of a model is precisely quantified in the stochastic complexity, which measures the number of bits needed to encode its parameters. In order to understand how simple models look like, we study the stochastic complexity of spin models with interactions of arbitrary order. We highlight the existence of invariances with respect to bijections within the space of operators, which allow us to partition the space of all models into equivalence classes, in which models share the same complexity. We thus found that the complexity (or simplicity) of a model is not determined by the order of the interactions, but rather by their mutual arrangements. Models where statistical dependencies are localized on non-overlapping groups of few variables (and that afford predictions on independencies that are easy to falsify) are simple. On the contrary, fully connected pairwise models, which are often used in statistical learning, appear to be highly complex, because of their extended set of interactions
    corecore