24,155 research outputs found
Robo-line storage: Low latency, high capacity storage systems over geographically distributed networks
Rapid advances in high performance computing are making possible more complete and accurate computer-based modeling of complex physical phenomena, such as weather front interactions, dynamics of chemical reactions, numerical aerodynamic analysis of airframes, and ocean-land-atmosphere interactions. Many of these 'grand challenge' applications are as demanding of the underlying storage system, in terms of their capacity and bandwidth requirements, as they are on the computational power of the processor. A global view of the Earth's ocean chlorophyll and land vegetation requires over 2 terabytes of raw satellite image data. In this paper, we describe our planned research program in high capacity, high bandwidth storage systems. The project has four overall goals. First, we will examine new methods for high capacity storage systems, made possible by low cost, small form factor magnetic and optical tape systems. Second, access to the storage system will be low latency and high bandwidth. To achieve this, we must interleave data transfer at all levels of the storage system, including devices, controllers, servers, and communications links. Latency will be reduced by extensive caching throughout the storage hierarchy. Third, we will provide effective management of a storage hierarchy, extending the techniques already developed for the Log Structured File System. Finally, we will construct a protototype high capacity file server, suitable for use on the National Research and Education Network (NREN). Such research must be a Cornerstone of any coherent program in high performance computing and communications
Optimal modeling for complex system design
The article begins with a brief introduction to the theory describing optimal data compression systems and their performance. A brief outline is then given of a representative algorithm that employs these lessons for optimal data compression system design. The implications of rate-distortion theory for practical data compression system design is then described, followed by a description of the tensions between theoretical optimality and system practicality and a discussion of common tools used in current algorithms to resolve these tensions. Next, the generalization of rate-distortion principles to the design of optimal collections of models is presented. The discussion focuses initially on data compression systems, but later widens to describe how rate-distortion theory principles generalize to model design for a wide variety of modeling applications. The article ends with a discussion of the performance benefits to be achieved using the multiple-model design algorithms
Parallel Implementation of Lossy Data Compression for Temporal Data Sets
Many scientific data sets contain temporal dimensions. These are the data
storing information at the same spatial location but different time stamps.
Some of the biggest temporal datasets are produced by parallel computing
applications such as simulations of climate change and fluid dynamics. Temporal
datasets can be very large and cost a huge amount of time to transfer among
storage locations. Using data compression techniques, files can be transferred
faster and save storage space. NUMARCK is a lossy data compression algorithm
for temporal data sets that can learn emerging distributions of element-wise
change ratios along the temporal dimension and encodes them into an index table
to be concisely represented. This paper presents a parallel implementation of
NUMARCK. Evaluated with six data sets obtained from climate and astrophysics
simulations, parallel NUMARCK achieved scalable speedups of up to 8788 when
running 12800 MPI processes on a parallel computer. We also compare the
compression ratios against two lossy data compression algorithms, ISABELA and
ZFP. The results show that NUMARCK achieved higher compression ratio than
ISABELA and ZFP.Comment: 10 pages, HiPC 201
Optimizing Lossy Compression Rate-Distortion from Automatic Online Selection between SZ and ZFP
With ever-increasing volumes of scientific data produced by HPC applications,
significantly reducing data size is critical because of limited capacity of
storage space and potential bottlenecks on I/O or networks in writing/reading
or transferring data. SZ and ZFP are the two leading lossy compressors
available to compress scientific data sets. However, their performance is not
consistent across different data sets and across different fields of some data
sets: for some fields SZ provides better compression performance, while other
fields are better compressed with ZFP. This situation raises the need for an
automatic online (during compression) selection between SZ and ZFP, with a
minimal overhead. In this paper, the automatic selection optimizes the
rate-distortion, an important statistical quality metric based on the
signal-to-noise ratio. To optimize for rate-distortion, we investigate the
principles of SZ and ZFP. We then propose an efficient online, low-overhead
selection algorithm that predicts the compression quality accurately for two
compressors in early processing stages and selects the best-fit compressor for
each data field. We implement the selection algorithm into an open-source
library, and we evaluate the effectiveness of our proposed solution against
plain SZ and ZFP in a parallel environment with 1,024 cores. Evaluation results
on three data sets representing about 100 fields show that our selection
algorithm improves the compression ratio up to 70% with the same level of data
distortion because of very accurate selection (around 99%) of the best-fit
compressor, with little overhead (less than 7% in the experiments).Comment: 14 pages, 9 figures, first revisio
- …