3,028 research outputs found

    Explain3D: Explaining Disagreements in Disjoint Datasets

    Get PDF
    Data plays an important role in applications, analytic processes, and many aspects of human activity. As data grows in size and complexity, we are met with an imperative need for tools that promote understanding and explanations over data-related operations. Data management research on explanations has focused on the assumption that data resides in a single dataset, under one common schema. But the reality of today's data is that it is frequently un-integrated, coming from different sources with different schemas. When different datasets provide different answers to semantically similar questions, understanding the reasons for the discrepancies is challenging and cannot be handled by the existing single-dataset solutions. In this paper, we propose Explain3D, a framework for explaining the disagreements across disjoint datasets (3D). Explain3D focuses on identifying the reasons for the differences in the results of two semantically similar queries operating on two datasets with potentially different schemas. Our framework leverages the queries to perform a semantic mapping across the relevant parts of their provenance; discrepancies in this mapping point to causes of the queries' differences. Exploiting the queries gives Explain3D an edge over traditional schema matching and record linkage techniques, which are query-agnostic. Our work makes the following contributions: (1) We formalize the problem of deriving optimal explanations for the differences of the results of semantically similar queries over disjoint datasets. (2) We design a 3-stage framework for solving the optimal explanation problem. (3) We develop a smart-partitioning optimizer that improves the efficiency of the framework by orders of magnitude. (4)~We experiment with real-world and synthetic data to demonstrate that Explain3D can derive precise explanations efficiently

    The HyperBagGraph DataEdron: An Enriched Browsing Experience of Multimedia Datasets

    Full text link
    Traditional verbatim browsers give back information in a linear way according to a ranking performed by a search engine that may not be optimal for the surfer. The latter may need to assess the pertinence of the information retrieved, particularly when sâ‹…\cdothe wants to explore other facets of a multi-facetted information space. For instance, in a multimedia dataset different facets such as keywords, authors, publication category, organisations and figures can be of interest. The facet simultaneous visualisation can help to gain insights on the information retrieved and call for further searches. Facets are co-occurence networks, modeled by HyperBag-Graphs -- families of multisets -- and are in fact linked not only to the publication itself, but to any chosen reference. These references allow to navigate inside the dataset and perform visual queries. We explore here the case of scientific publications based on Arxiv searches.Comment: Extension of the hypergraph framework shortly presented in arXiv:1809.00164 (possible small overlaps); use the theoretical framework of hb-graphs presented in arXiv:1809.0019

    Indexing Temporal XML documents

    Get PDF

    Pregelix: Big(ger) Graph Analytics on A Dataflow Engine

    Full text link
    There is a growing need for distributed graph processing systems that are capable of gracefully scaling to very large graph datasets. Unfortunately, this challenge has not been easily met due to the intense memory pressure imposed by process-centric, message passing designs that many graph processing systems follow. Pregelix is a new open source distributed graph processing system that is based on an iterative dataflow design that is better tuned to handle both in-memory and out-of-core workloads. As such, Pregelix offers improved performance characteristics and scaling properties over current open source systems (e.g., we have seen up to 15x speedup compared to Apache Giraph and up to 35x speedup compared to distributed GraphLab), and makes more effective use of available machine resources to support Big(ger) Graph Analytics

    Towards flexible indices for distributed graph data: The formal schema-level index model FLuID

    Get PDF
    Graph indices are a key to manage huge amounts of distributed graph data. Instance-level indices are available that focus on the fast retrieval of nodes. Furthermore, there are so-called schema-level indices focusing on summarizing nodes sharing common characteristics, i. e., the combination of attached types and used property-labels. We argue that there is not a one-size-fits-all schema-level index. Rather, a parameterized, formal model is needed that allows to quickly design, tailor, and compare different schema-level indices. We abstract from related works and provide the formal model FLuID using basic building blocks to flexibly define different schema-level indices. The FLuID model provides parameterized simple and complex schema elements together with four parameters. We show that all indices modeled in FLuID can be computed in O(n). Thus, FLuID enables us to efficiently implement, compare, and validate variants of schema-level indices tailored for specific application scenarios
    • …
    corecore