893 research outputs found
DataHub: Collaborative Data Science & Dataset Version Management at Scale
Relational databases have limited support for data collaboration, where teams
collaboratively curate and analyze large datasets. Inspired by software version
control systems like git, we propose (a) a dataset version control system,
giving users the ability to create, branch, merge, difference and search large,
divergent collections of datasets, and (b) a platform, DataHub, that gives
users the ability to perform collaborative data analysis building on this
version control system. We outline the challenges in providing dataset version
control at scale.Comment: 7 page
An Efficient Built-in Temporal Support in MVCC-based Graph Databases
Real-world graphs are often dynamic and evolve over time. To trace the
evolving properties of graphs, it is necessary to maintain every change of both
vertices and edges in graph databases with the support of temporal features.
Existing works either maintain all changes in a single graph or periodically
materialize snapshots to maintain the historical states of each vertex and edge
and process queries over proper snapshots. The former approach presents poor
query performance due to the ever-growing graph size as time goes by, while the
latter one suffers from prohibitively high storage overheads due to large
redundant copies of graph data across different snapshots. In this paper, we
propose a hybrid data storage engine, which is based on the MVCC mechanism, to
separately manage current and historical data, which keeps the current graph as
small as possible. In our design, changes in each vertex or edge are stored
once. To further reduce the storage overhead, we simply store the changes as
opposed to storing the complete snapshot. To boost the query performance, we
place a few anchors as snapshots to avoid deep historical version traversals.
Based on the storage engine, a temporal query engine is proposed to reconstruct
subgraphs as needed on the fly. Therefore, our alternative approach can provide
fast querying capabilities over subgraphs at a past time point or range with
small storage overheads. To provide native support of temporal features, we
integrate our approach into Memgraph, and call the extended database system
TGDB(Temporal Graph Database). Extensive experiments are conducted on four real
and synthetic datasets. The results show TGDB performs better in terms of both
storage and performance against state-of-the-art methods and has almost no
performance overheads by introducing the temporal features
Efficient Snapshot Retrieval over Historical Graph Data
We address the problem of managing historical data for large evolving
information networks like social networks or citation networks, with the goal
to enable temporal and evolutionary queries and analysis. We present the design
and architecture of a distributed graph database system that stores the entire
history of a network and provides support for efficient retrieval of multiple
graphs from arbitrary time points in the past, in addition to maintaining the
current state for ongoing updates. Our system exposes a general programmatic
API to process and analyze the retrieved snapshots. We introduce DeltaGraph, a
novel, extensible, highly tunable, and distributed hierarchical index structure
that enables compactly recording the historical information, and that supports
efficient retrieval of historical graph snapshots for single-site or parallel
processing. Along with the original graph data, DeltaGraph can also maintain
and index auxiliary information; this functionality can be used to extend the
structure to efficiently execute queries like subgraph pattern matching over
historical data. We develop analytical models for both the storage space needed
and the snapshot retrieval times to aid in choosing the right parameters for a
specific scenario. In addition, we present strategies for materializing
portions of the historical graph state in memory to further speed up the
retrieval process. Secondly, we present an in-memory graph data structure
called GraphPool that can maintain hundreds of historical graph instances in
main memory in a non-redundant manner. We present a comprehensive experimental
evaluation that illustrates the effectiveness of our proposed techniques at
managing historical graph information
Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff
The relative ease of collaborative data science and analysis has led to a
proliferation of many thousands or millions of of the same datasets
in many scientific and commercial domains, acquired or constructed at various
stages of data analysis across many users, and often over long periods of time.
Managing, storing, and recreating these dataset versions is a non-trivial task.
The fundamental challenge here is the : the more
storage we use, the faster it is to recreate or retrieve versions, while the
less storage we use, the slower it is to recreate or retrieve versions. Despite
the fundamental nature of this problem, there has been a surprisingly little
amount of work on it. In this paper, we study this trade-off in a principled
manner: we formulate six problems under various settings, trading off these
quantities in various ways, demonstrate that most of the problems are
intractable, and propose a suite of inexpensive heuristics drawing from
techniques in delay-constrained scheduling, and spanning tree literature, to
solve these problems. We have built a prototype version management system, that
aims to serve as a foundation to our DATAHUB system for facilitating
collaborative data science. We demonstrate, via extensive experiments, that our
proposed heuristics provide efficient solutions in practical dataset versioning
scenarios
- …