1,742 research outputs found
Efficient Versioning for Scientific Array Databases
In this paper, we describe a versioned database storage manager we are developing for the SciDB scientific database. The system is designed to efficiently store and retrieve array-oriented data, exposing a "no-overwrite" storage model in which each update creates a new "version" of an array. This makes it possible to perform comparisons of versions produced at different times or by different algorithms, and to create complex chains and trees of versions. We present algorithms to efficiently encode these versions, minimizing storage space while still providing efficient access to the data. Additionally, we present an optimal algorithm that, given a long sequence of versions, determines which versions to encode in terms of each other (using delta compression) to minimize total storage space or query execution cost. We compare the performance of these algorithms on real world data sets from the National Oceanic and Atmospheric Administration (NOAA), Open Street Maps, and several other sources. We show that our algorithms provide better performance than existing version control systems not optimized for array data, both in terms of storage size and access time, and that our delta-compression algorithms are able to substantially reduce the total storage space when versions exist with a high degree of similarity.National Science Foundation (U.S.) (Grant IIS/III-1111371)National Science Foundation (U.S.) (Grant SI2-1047955
DataHub: Collaborative Data Science & Dataset Version Management at Scale
Relational databases have limited support for data collaboration, where teams
collaboratively curate and analyze large datasets. Inspired by software version
control systems like git, we propose (a) a dataset version control system,
giving users the ability to create, branch, merge, difference and search large,
divergent collections of datasets, and (b) a platform, DataHub, that gives
users the ability to perform collaborative data analysis building on this
version control system. We outline the challenges in providing dataset version
control at scale.Comment: 7 page
A Survey on Array Storage, Query Languages, and Systems
Since scientific investigation is one of the most important providers of
massive amounts of ordered data, there is a renewed interest in array data
processing in the context of Big Data. To the best of our knowledge, a unified
resource that summarizes and analyzes array processing research over its long
existence is currently missing. In this survey, we provide a guide for past,
present, and future research in array processing. The survey is organized along
three main topics. Array storage discusses all the aspects related to array
partitioning into chunks. The identification of a reduced set of array
operators to form the foundation for an array query language is analyzed across
multiple such proposals. Lastly, we survey real systems for array processing.
The result is a thorough survey on array data storage and processing that
should be consulted by anyone interested in this research topic, independent of
experience level. The survey is not complete though. We greatly appreciate
pointers towards any work we might have forgotten to mention.Comment: 44 page
ArrayBridge: Interweaving declarative array processing with high-performance computing
Scientists are increasingly turning to datacenter-scale computers to produce
and analyze massive arrays. Despite decades of database research that extols
the virtues of declarative query processing, scientists still write, debug and
parallelize imperative HPC kernels even for the most mundane queries. This
impedance mismatch has been partly attributed to the cumbersome data loading
process; in response, the database community has proposed in situ mechanisms to
access data in scientific file formats. Scientists, however, desire more than a
passive access method that reads arrays from files.
This paper describes ArrayBridge, a bi-directional array view mechanism for
scientific file formats, that aims to make declarative array manipulations
interoperable with imperative file-centric analyses. Our prototype
implementation of ArrayBridge uses HDF5 as the underlying array storage library
and seamlessly integrates into the SciDB open-source array database system. In
addition to fast querying over external array objects, ArrayBridge produces
arrays in the HDF5 file format just as easily as it can read from it.
ArrayBridge also supports time travel queries from imperative kernels through
the unmodified HDF5 API, and automatically deduplicates between array versions
for space efficiency. Our extensive performance evaluation in NERSC, a
large-scale scientific computing facility, shows that ArrayBridge exhibits
statistically indistinguishable performance and I/O scalability to the native
SciDB storage engine.Comment: 12 pages, 13 figure
Benchmarking SciDB Data Import on HPC Systems
SciDB is a scalable, computational database management system that uses an
array model for data storage. The array data model of SciDB makes it ideally
suited for storing and managing large amounts of imaging data. SciDB is
designed to support advanced analytics in database, thus reducing the need for
extracting data for analysis. It is designed to be massively parallel and can
run on commodity hardware in a high performance computing (HPC) environment. In
this paper, we present the performance of SciDB using simulated image data. The
Dynamic Distributed Dimensional Data Model (D4M) software is used to implement
the benchmark on a cluster running the MIT SuperCloud software stack. A peak
performance of 2.2M database inserts per second was achieved on a single node
of this system. We also show that SciDB and the D4M toolbox provide more
efficient ways to access random sub-volumes of massive datasets compared to the
traditional approaches of reading volumetric data from individual files. This
work describes the D4M and SciDB tools we developed and presents the initial
performance results. This performance was achieved by using parallel inserts, a
in-database merging of arrays as well as supercomputing techniques, such as
distributed arrays and single-program-multiple-data programming.Comment: 5 pages, 4 figures, IEEE High Performance Extreme Computing (HPEC)
2016, best paper finalis
Impliance: A Next Generation Information Management Appliance
ably successful in building a large market and adapting to the changes of the
last three decades, its impact on the broader market of information management
is surprisingly limited. If we were to design an information management system
from scratch, based upon today's requirements and hardware capabilities, would
it look anything like today's database systems?" In this paper, we introduce
Impliance, a next-generation information management system consisting of
hardware and software components integrated to form an easy-to-administer
appliance that can store, retrieve, and analyze all types of structured,
semi-structured, and unstructured information. We first summarize the trends
that will shape information management for the foreseeable future. Those trends
imply three major requirements for Impliance: (1) to be able to store, manage,
and uniformly query all data, not just structured records; (2) to be able to
scale out as the volume of this data grows; and (3) to be simple and robust in
operation. We then describe four key ideas that are uniquely combined in
Impliance to address these requirements, namely the ideas of: (a) integrating
software and off-the-shelf hardware into a generic information appliance; (b)
automatically discovering, organizing, and managing all data - unstructured as
well as structured - in a uniform way; (c) achieving scale-out by exploiting
simple, massive parallel processing, and (d) virtualizing compute and storage
resources to unify, simplify, and streamline the management of Impliance.
Impliance is an ambitious, long-term effort to define simpler, more robust, and
more scalable information systems for tomorrow's enterprises.Comment: This article is published under a Creative Commons License Agreement
(http://creativecommons.org/licenses/by/2.5/.) You may copy, distribute,
display, and perform the work, make derivative works and make commercial use
of the work, but, you must attribute the work to the author and CIDR 2007.
3rd Biennial Conference on Innovative Data Systems Research (CIDR) January
710, 2007, Asilomar, California, US
Decibel: the relational dataset branching system
As scientific endeavors and data analysis become increasingly collaborative, there is a need for data management systems that natively support the versioning or branching of datasets to enable concurrent analysis, cleaning, integration, manipulation, or curation of data across teams of individuals. Common practice for sharing and collaborating on datasets involves creating or storing multiple copies of the dataset, one for each stage of analysis, with no provenance information tracking the relationships between these datasets. This results not only in wasted storage, but also makes it challenging to track and integrate modifications made by different users to the same dataset. In this paper, we introduce the Relational Dataset Branching System, Decibel, a new relational storage system with built-in version control designed to address these short-comings. We present our initial design for Decibel and provide a thorough evaluation of three versioned storage engine designs that focus on efficient query processing with minimal storage overhead. We also develop an exhaustive benchmark to enable the rigorous testing of these and future versioned storage engine designs.National Science Foundation (U.S.) (1513972)National Science Foundation (U.S.) (1513407)National Science Foundation (U.S.) (1513443)Intel Science and Technology Center for Big Dat
- …