3,533 research outputs found
Distributed Caching for Processing Raw Arrays
As applications continue to generate multi-dimensional data at exponentially increasing rates, fast analytics to extract meaningful results is becoming extremely important. The database community has developed array databases that alleviate this problem through a series of techniques. In-situ mechanisms provide direct access to raw data in the original format---without loading and partitioning. Parallel processing scales to the largest datasets. In-memory caching reduces latency when the same data are accessed across a workload of queries. However, we are not aware of any work on distributed caching of multi-dimensional raw arrays. In this paper, we introduce a distributed framework for cost-based caching of multi-dimensional arrays in native format. Given a set of files that contain portions of an array and an online query workload, the framework computes an effective caching plan in two stages. First, the plan identifies the cells to be cached locally from each of the input files by continuously refining an evolving R-tree index. In the second stage, an optimal assignment of cells to nodes that collocates dependent cells in order to minimize the overall data transfer is determined. We design cache eviction and placement heuristic algorithms that consider the historical query workload. A thorough experimental evaluation over two real datasets in three file formats confirms the superiority - by as much as two orders of magnitude - of the proposed framework over existing techniques in terms of cache overhead and workload execution time
Distributed Caching for Complex Querying of Raw Arrays
As applications continue to generate multi-dimensional data at exponentially
increasing rates, fast analytics to extract meaningful results is becoming
extremely important. The database community has developed array databases that
alleviate this problem through a series of techniques. In-situ mechanisms
provide direct access to raw data in the original format---without loading and
partitioning. Parallel processing scales to the largest datasets. In-memory
caching reduces latency when the same data are accessed across a workload of
queries. However, we are not aware of any work on distributed caching of
multi-dimensional raw arrays. In this paper, we introduce a distributed
framework for cost-based caching of multi-dimensional arrays in native format.
Given a set of files that contain portions of an array and an online query
workload, the framework computes an effective caching plan in two stages.
First, the plan identifies the cells to be cached locally from each of the
input files by continuously refining an evolving R-tree index. In the second
stage, an optimal assignment of cells to nodes that collocates dependent cells
in order to minimize the overall data transfer is determined. We design cache
eviction and placement heuristic algorithms that consider the historical query
workload. A thorough experimental evaluation over two real datasets in three
file formats confirms the superiority -- by as much as two orders of magnitude
-- of the proposed framework over existing techniques in terms of cache
overhead and workload execution time
ArrayBridge: Interweaving declarative array processing with high-performance computing
Scientists are increasingly turning to datacenter-scale computers to produce
and analyze massive arrays. Despite decades of database research that extols
the virtues of declarative query processing, scientists still write, debug and
parallelize imperative HPC kernels even for the most mundane queries. This
impedance mismatch has been partly attributed to the cumbersome data loading
process; in response, the database community has proposed in situ mechanisms to
access data in scientific file formats. Scientists, however, desire more than a
passive access method that reads arrays from files.
This paper describes ArrayBridge, a bi-directional array view mechanism for
scientific file formats, that aims to make declarative array manipulations
interoperable with imperative file-centric analyses. Our prototype
implementation of ArrayBridge uses HDF5 as the underlying array storage library
and seamlessly integrates into the SciDB open-source array database system. In
addition to fast querying over external array objects, ArrayBridge produces
arrays in the HDF5 file format just as easily as it can read from it.
ArrayBridge also supports time travel queries from imperative kernels through
the unmodified HDF5 API, and automatically deduplicates between array versions
for space efficiency. Our extensive performance evaluation in NERSC, a
large-scale scientific computing facility, shows that ArrayBridge exhibits
statistically indistinguishable performance and I/O scalability to the native
SciDB storage engine.Comment: 12 pages, 13 figure
Formal Representation of the SS-DB Benchmark and Experimental Evaluation in EXTASCID
Evaluating the performance of scientific data processing systems is a
difficult task considering the plethora of application-specific solutions
available in this landscape and the lack of a generally-accepted benchmark. The
dual structure of scientific data coupled with the complex nature of processing
complicate the evaluation procedure further. SS-DB is the first attempt to
define a general benchmark for complex scientific processing over raw and
derived data. It fails to draw sufficient attention though because of the
ambiguous plain language specification and the extraordinary SciDB results. In
this paper, we remedy the shortcomings of the original SS-DB specification by
providing a formal representation in terms of ArrayQL algebra operators and
ArrayQL/SciQL constructs. These are the first formal representations of the
SS-DB benchmark. Starting from the formal representation, we give a reference
implementation and present benchmark results in EXTASCID, a novel system for
scientific data processing. EXTASCID is complete in providing native support
both for array and relational data and extensible in executing any user code
inside the system by the means of a configurable metaoperator. These features
result in an order of magnitude improvement over SciDB at data loading,
extracting derived data, and operations over derived data.Comment: 32 pages, 3 figure
Robo-line storage: Low latency, high capacity storage systems over geographically distributed networks
Rapid advances in high performance computing are making possible more complete and accurate computer-based modeling of complex physical phenomena, such as weather front interactions, dynamics of chemical reactions, numerical aerodynamic analysis of airframes, and ocean-land-atmosphere interactions. Many of these 'grand challenge' applications are as demanding of the underlying storage system, in terms of their capacity and bandwidth requirements, as they are on the computational power of the processor. A global view of the Earth's ocean chlorophyll and land vegetation requires over 2 terabytes of raw satellite image data. In this paper, we describe our planned research program in high capacity, high bandwidth storage systems. The project has four overall goals. First, we will examine new methods for high capacity storage systems, made possible by low cost, small form factor magnetic and optical tape systems. Second, access to the storage system will be low latency and high bandwidth. To achieve this, we must interleave data transfer at all levels of the storage system, including devices, controllers, servers, and communications links. Latency will be reduced by extensive caching throughout the storage hierarchy. Third, we will provide effective management of a storage hierarchy, extending the techniques already developed for the Log Structured File System. Finally, we will construct a protototype high capacity file server, suitable for use on the National Research and Education Network (NREN). Such research must be a Cornerstone of any coherent program in high performance computing and communications
The CDF Data Handling System
The Collider Detector at Fermilab (CDF) records proton-antiproton collisions
at center of mass energy of 2.0 TeV at the Tevatron collider. A new collider
run, Run II, of the Tevatron started in April 2001. Increased luminosity will
result in about 1~PB of data recorded on tapes in the next two years. Currently
the CDF experiment has about 260 TB of data stored on tapes. This amount
includes raw and reconstructed data and their derivatives.
The data storage and retrieval are managed by the CDF Data Handling (DH)
system. This system has been designed to accommodate the increased demands of
the Run II environment and has proven robust and reliable in providing reliable
flow of data from the detector to the end user. This paper gives an overview of
the CDF Run II Data Handling system which has evolved significantly over the
course of this year. An outline of the future direction of the system is given.Comment: Talk from the 2003 Computing in High Energy and Nuclear Physics
(CHEP03), La Jolla, Ca, USA, March 2003, 7 pages, LaTeX, 4 EPS figures, PSN
THKT00
- …