893 research outputs found

    DataHub: Collaborative Data Science & Dataset Version Management at Scale

    Get PDF
    Relational databases have limited support for data collaboration, where teams collaboratively curate and analyze large datasets. Inspired by software version control systems like git, we propose (a) a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and (b) a platform, DataHub, that gives users the ability to perform collaborative data analysis building on this version control system. We outline the challenges in providing dataset version control at scale.Comment: 7 page

    An Efficient Built-in Temporal Support in MVCC-based Graph Databases

    Full text link
    Real-world graphs are often dynamic and evolve over time. To trace the evolving properties of graphs, it is necessary to maintain every change of both vertices and edges in graph databases with the support of temporal features. Existing works either maintain all changes in a single graph or periodically materialize snapshots to maintain the historical states of each vertex and edge and process queries over proper snapshots. The former approach presents poor query performance due to the ever-growing graph size as time goes by, while the latter one suffers from prohibitively high storage overheads due to large redundant copies of graph data across different snapshots. In this paper, we propose a hybrid data storage engine, which is based on the MVCC mechanism, to separately manage current and historical data, which keeps the current graph as small as possible. In our design, changes in each vertex or edge are stored once. To further reduce the storage overhead, we simply store the changes as opposed to storing the complete snapshot. To boost the query performance, we place a few anchors as snapshots to avoid deep historical version traversals. Based on the storage engine, a temporal query engine is proposed to reconstruct subgraphs as needed on the fly. Therefore, our alternative approach can provide fast querying capabilities over subgraphs at a past time point or range with small storage overheads. To provide native support of temporal features, we integrate our approach into Memgraph, and call the extended database system TGDB(Temporal Graph Database). Extensive experiments are conducted on four real and synthetic datasets. The results show TGDB performs better in terms of both storage and performance against state-of-the-art methods and has almost no performance overheads by introducing the temporal features

    Efficient Snapshot Retrieval over Historical Graph Data

    Full text link
    We address the problem of managing historical data for large evolving information networks like social networks or citation networks, with the goal to enable temporal and evolutionary queries and analysis. We present the design and architecture of a distributed graph database system that stores the entire history of a network and provides support for efficient retrieval of multiple graphs from arbitrary time points in the past, in addition to maintaining the current state for ongoing updates. Our system exposes a general programmatic API to process and analyze the retrieved snapshots. We introduce DeltaGraph, a novel, extensible, highly tunable, and distributed hierarchical index structure that enables compactly recording the historical information, and that supports efficient retrieval of historical graph snapshots for single-site or parallel processing. Along with the original graph data, DeltaGraph can also maintain and index auxiliary information; this functionality can be used to extend the structure to efficiently execute queries like subgraph pattern matching over historical data. We develop analytical models for both the storage space needed and the snapshot retrieval times to aid in choosing the right parameters for a specific scenario. In addition, we present strategies for materializing portions of the historical graph state in memory to further speed up the retrieval process. Secondly, we present an in-memory graph data structure called GraphPool that can maintain hundreds of historical graph instances in main memory in a non-redundant manner. We present a comprehensive experimental evaluation that illustrates the effectiveness of our proposed techniques at managing historical graph information

    Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff

    Get PDF
    The relative ease of collaborative data science and analysis has led to a proliferation of many thousands or millions of versionsversions of the same datasets in many scientific and commercial domains, acquired or constructed at various stages of data analysis across many users, and often over long periods of time. Managing, storing, and recreating these dataset versions is a non-trivial task. The fundamental challenge here is the storage−recreation  trade−offstorage-recreation\;trade-off: the more storage we use, the faster it is to recreate or retrieve versions, while the less storage we use, the slower it is to recreate or retrieve versions. Despite the fundamental nature of this problem, there has been a surprisingly little amount of work on it. In this paper, we study this trade-off in a principled manner: we formulate six problems under various settings, trading off these quantities in various ways, demonstrate that most of the problems are intractable, and propose a suite of inexpensive heuristics drawing from techniques in delay-constrained scheduling, and spanning tree literature, to solve these problems. We have built a prototype version management system, that aims to serve as a foundation to our DATAHUB system for facilitating collaborative data science. We demonstrate, via extensive experiments, that our proposed heuristics provide efficient solutions in practical dataset versioning scenarios
    • …
    corecore