19,493 research outputs found
D4M 3.0: Extended Database and Language Capabilities
The D4M tool was developed to address many of today's data needs. This tool
is used by hundreds of researchers to perform complex analytics on unstructured
data. Over the past few years, the D4M toolbox has evolved to support
connectivity with a variety of new database engines, including SciDB.
D4M-Graphulo provides the ability to do graph analytics in the Apache Accumulo
database. Finally, an implementation using the Julia programming language is
also now available. In this article, we describe some of our latest additions
to the D4M toolbox and our upcoming D4M 3.0 release. We show through
benchmarking and scaling results that we can achieve fast SciDB ingest using
the D4M-SciDB connector, that using Graphulo can enable graph algorithms on
scales that can be memory limited, and that the Julia implementation of D4M
achieves comparable performance or exceeds that of the existing MATLAB(R)
implementation.Comment: IEEE HPEC 201
Graphulo Implementation of Server-Side Sparse Matrix Multiply in the Accumulo Database
The Apache Accumulo database excels at distributed storage and indexing and
is ideally suited for storing graph data. Many big data analytics compute on
graph data and persist their results back to the database. These graph
calculations are often best performed inside the database server. The GraphBLAS
standard provides a compact and efficient basis for a wide range of graph
applications through a small number of sparse matrix operations. In this
article, we implement GraphBLAS sparse matrix multiplication server-side by
leveraging Accumulo's native, high-performance iterators. We compare the
mathematics and performance of inner and outer product implementations, and
show how an outer product implementation achieves optimal performance near
Accumulo's peak write rate. We offer our work as a core component to the
Graphulo library that will deliver matrix math primitives for graph analytics
within Accumulo.Comment: To be presented at IEEE HPEC 2015: http://www.ieee-hpec.org
Benchmarking SciDB Data Import on HPC Systems
SciDB is a scalable, computational database management system that uses an
array model for data storage. The array data model of SciDB makes it ideally
suited for storing and managing large amounts of imaging data. SciDB is
designed to support advanced analytics in database, thus reducing the need for
extracting data for analysis. It is designed to be massively parallel and can
run on commodity hardware in a high performance computing (HPC) environment. In
this paper, we present the performance of SciDB using simulated image data. The
Dynamic Distributed Dimensional Data Model (D4M) software is used to implement
the benchmark on a cluster running the MIT SuperCloud software stack. A peak
performance of 2.2M database inserts per second was achieved on a single node
of this system. We also show that SciDB and the D4M toolbox provide more
efficient ways to access random sub-volumes of massive datasets compared to the
traditional approaches of reading volumetric data from individual files. This
work describes the D4M and SciDB tools we developed and presents the initial
performance results. This performance was achieved by using parallel inserts, a
in-database merging of arrays as well as supercomputing techniques, such as
distributed arrays and single-program-multiple-data programming.Comment: 5 pages, 4 figures, IEEE High Performance Extreme Computing (HPEC)
2016, best paper finalis
Enabling On-Demand Database Computing with MIT SuperCloud Database Management System
The MIT SuperCloud database management system allows for rapid creation and
flexible execution of a variety of the latest scientific databases, including
Apache Accumulo and SciDB. It is designed to permit these databases to run on a
High Performance Computing Cluster (HPCC) platform as seamlessly as any other
HPCC job. It ensures the seamless migration of the databases to the resources
assigned by the HPCC scheduler and centralized storage of the database files
when not running. It also permits snapshotting of databases to allow
researchers to experiment and push the limits of the technology without
concerns for data or productivity loss if the database becomes unstable.Comment: 6 pages; accepted to IEEE High Performance Extreme Computing (HPEC)
conference 2015. arXiv admin note: text overlap with arXiv:1406.492
Big Data Dimensional Analysis
The ability to collect and analyze large amounts of data is a growing problem
within the scientific community. The growing gap between data and users calls
for innovative tools that address the challenges faced by big data volume,
velocity and variety. One of the main challenges associated with big data
variety is automatically understanding the underlying structures and patterns
of the data. Such an understanding is required as a pre-requisite to the
application of advanced analytics to the data. Further, big data sets often
contain anomalies and errors that are difficult to know a priori. Current
approaches to understanding data structure are drawn from the traditional
database ontology design. These approaches are effective, but often require too
much human involvement to be effective for the volume, velocity and variety of
data encountered by big data systems. Dimensional Data Analysis (DDA) is a
proposed technique that allows big data analysts to quickly understand the
overall structure of a big dataset, determine anomalies. DDA exploits
structures that exist in a wide class of data to quickly determine the nature
of the data and its statical anomalies. DDA leverages existing schemas that are
employed in big data databases today. This paper presents DDA, applies it to a
number of data sets, and measures its performance. The overhead of DDA is low
and can be applied to existing big data systems without greatly impacting their
computing requirements.Comment: From IEEE HPEC 201
- …