52,163 research outputs found
Mapping Datasets to Object Storage System
Access libraries such as ROOT and HDF5 allow users to interact with datasets
using high level abstractions, like coordinate systems and associated slicing
operations. Unfortunately, the implementations of access libraries are based on
outdated assumptions about storage systems interfaces and are generally unable
to fully benefit from modern fast storage devices. The situation is getting
worse with rapidly evolving storage devices such as non-volatile memory and
ever larger datasets. This project explores distributed dataset mapping
infrastructures that can integrate and scale out existing access libraries
using Ceph's extensible object model, avoiding re-implementation or even
modifications of these access libraries as much as possible. These programmable
storage extensions coupled with our distributed dataset mapping techniques
enable: 1) access library operations to be offloaded to storage system servers,
2) the independent evolution of access libraries and storage systems and 3)
fully leveraging of the existing load balancing, elasticity, and failure
management of distributed storage systems like Ceph. They also create more
opportunities to conduct storage server-local optimizations specific to storage
servers. For example, storage servers might include local key/value stores
combined with chunk stores that require different optimizations than a local
file system. As storage servers evolve to support new storage devices like
non-volatile memory, these server-local optimizations can be implemented while
minimizing disruptions to applications. We will report progress on the means by
which distributed dataset mapping can be abstracted over particular access
libraries, including access libraries for ROOT data, and how we address some of
the challenges revolving around data partitioning and composability of access
operations
ArrayBridge: Interweaving declarative array processing with high-performance computing
Scientists are increasingly turning to datacenter-scale computers to produce
and analyze massive arrays. Despite decades of database research that extols
the virtues of declarative query processing, scientists still write, debug and
parallelize imperative HPC kernels even for the most mundane queries. This
impedance mismatch has been partly attributed to the cumbersome data loading
process; in response, the database community has proposed in situ mechanisms to
access data in scientific file formats. Scientists, however, desire more than a
passive access method that reads arrays from files.
This paper describes ArrayBridge, a bi-directional array view mechanism for
scientific file formats, that aims to make declarative array manipulations
interoperable with imperative file-centric analyses. Our prototype
implementation of ArrayBridge uses HDF5 as the underlying array storage library
and seamlessly integrates into the SciDB open-source array database system. In
addition to fast querying over external array objects, ArrayBridge produces
arrays in the HDF5 file format just as easily as it can read from it.
ArrayBridge also supports time travel queries from imperative kernels through
the unmodified HDF5 API, and automatically deduplicates between array versions
for space efficiency. Our extensive performance evaluation in NERSC, a
large-scale scientific computing facility, shows that ArrayBridge exhibits
statistically indistinguishable performance and I/O scalability to the native
SciDB storage engine.Comment: 12 pages, 13 figure
The CDF Data Handling System
The Collider Detector at Fermilab (CDF) records proton-antiproton collisions
at center of mass energy of 2.0 TeV at the Tevatron collider. A new collider
run, Run II, of the Tevatron started in April 2001. Increased luminosity will
result in about 1~PB of data recorded on tapes in the next two years. Currently
the CDF experiment has about 260 TB of data stored on tapes. This amount
includes raw and reconstructed data and their derivatives.
The data storage and retrieval are managed by the CDF Data Handling (DH)
system. This system has been designed to accommodate the increased demands of
the Run II environment and has proven robust and reliable in providing reliable
flow of data from the detector to the end user. This paper gives an overview of
the CDF Run II Data Handling system which has evolved significantly over the
course of this year. An outline of the future direction of the system is given.Comment: Talk from the 2003 Computing in High Energy and Nuclear Physics
(CHEP03), La Jolla, Ca, USA, March 2003, 7 pages, LaTeX, 4 EPS figures, PSN
THKT00
Rumble: Data Independence for Large Messy Data Sets
This paper introduces Rumble, an engine that executes JSONiq queries on
large, heterogeneous and nested collections of JSON objects, leveraging the
parallel capabilities of Spark so as to provide a high degree of data
independence. The design is based on two key insights: (i) how to map JSONiq
expressions to Spark transformations on RDDs and (ii) how to map JSONiq FLWOR
clauses to Spark SQL on DataFrames. We have developed a working implementation
of these mappings showing that JSONiq can efficiently run on Spark to query
billions of objects into, at least, the TB range. The JSONiq code is concise in
comparison to Spark's host languages while seamlessly supporting the nested,
heterogeneous data sets that Spark SQL does not. The ability to process this
kind of input, commonly found, is paramount for data cleaning and curation. The
experimental analysis indicates that there is no excessive performance loss,
occasionally even a gain, over Spark SQL for structured data, and a performance
gain over PySpark. This demonstrates that a language such as JSONiq is a simple
and viable approach to large-scale querying of denormalized, heterogeneous,
arborescent data sets, in the same way as SQL can be leveraged for structured
data sets. The results also illustrate that Codd's concept of data independence
makes as much sense for heterogeneous, nested data sets as it does on highly
structured tables.Comment: Preprint, 9 page
A Taxonomy of Data Grids for Distributed Data Sharing, Management and Processing
Data Grids have been adopted as the platform for scientific communities that
need to share, access, transport, process and manage large data collections
distributed worldwide. They combine high-end computing technologies with
high-performance networking and wide-area storage management techniques. In
this paper, we discuss the key concepts behind Data Grids and compare them with
other data sharing and distribution paradigms such as content delivery
networks, peer-to-peer networks and distributed databases. We then provide
comprehensive taxonomies that cover various aspects of architecture, data
transportation, data replication and resource allocation and scheduling.
Finally, we map the proposed taxonomy to various Data Grid systems not only to
validate the taxonomy but also to identify areas for future exploration.
Through this taxonomy, we aim to categorise existing systems to better
understand their goals and their methodology. This would help evaluate their
applicability for solving similar problems. This taxonomy also provides a "gap
analysis" of this area through which researchers can potentially identify new
issues for investigation. Finally, we hope that the proposed taxonomy and
mapping also helps to provide an easy way for new practitioners to understand
this complex area of research.Comment: 46 pages, 16 figures, Technical Repor
- …