28 research outputs found
Recommended from our members
Analyzing data properties using statistical sampling techniques – illustrated on scientific file formats and compression features
Understanding the characteristics of data stored in data centers helps computer scientists in identifying the most suitable storage infrastructure to deal with these workloads. For example, knowing the relevance of file formats allows optimizing the relevant formats but also helps in a procurement to define benchmarks that cover these formats. Existing studies that investigate performance improvements and techniques for data reduction such as deduplication and compression operate on a small set of data. Some of those studies claim the selected data is representative and scale their result to the scale of the data center. One hurdle of running novel schemes on the complete data is the vast amount of data stored and, thus, the resources required to analyze the complete data set. Even if this would be feasible, the costs for running many of those experiments must be justified.
This paper investigates stochastic sampling methods to compute and analyze quantities of interest on file numbers but also on the occupied storage space. It will be demonstrated that on our production system, scanning 1 % of files and data volume is sufficient to deduct conclusions. This speeds up the analysis process and reduces costs of such studies significantly. The contributions of this paper are: (1) the systematic investigation of the inherent analysis error when operating only on a subset of data, (2) the demonstration of methods that help future studies to mitigate this error, (3) the illustration of the approach on a study for scientific file types and compression for a data center
Recommended from our members
Analyzing data properties using statistical sampling: illustrated on scientific file formats
Understanding the characteristics of data stored in data centers helps computer scientists in identifying the most suitable storage infrastructure to deal with these workloads. For example, knowing the relevance of file formats allows optimizing the relevant formats but also helps in a procurement to define benchmarks that cover these formats. Existing studies that investigate performance improvements and techniques for data reduction such as deduplication and compression operate on a subset of data. Some of those studies claim the selected data is representative and scale their result to the scale of the data center. One hurdle of running novel schemes on the complete data is the vast amount of data stored and, thus, the resources required to analyze the complete data set. Even if this would be feasible, the costs for running many of those experiments must be justified.
This paper investigates stochastic sampling methods to compute and analyze quantities of interest on file numbers but also on the occupied storage space. It will be demonstrated that on our production system, scanning 1% of files and data volume is sufficient to deduct conclusions. This speeds up the analysis process and reduces costs of such studies significantly
Recommended from our members
Potential of I/O aware workflows in climate and weather
The efficient, convenient, and robust execution of data-driven workflows and enhanced data
management are essential for productivity in scientific computing. In HPC, the concerns of storage
and computing are traditionally separated and optimised independently from each other and the
needs of the end-to-end user. However, in complex workflows, this is becoming problematic. These
problems are particularly acute in climate and weather workflows, which as well as becoming
increasingly complex and exploiting deep storage hierarchies, can involve multiple data centres.
The key contributions of this paper are: 1) A sketch of a vision for an integrated data-driven
approach, with a discussion of the associated challenges and implications, and 2) An architecture
and roadmap consistent with this vision that would allow a seamless integration into current
climate and weather workflows as it utilises versions of existing tools (ESDM, Cylc, XIOS, and
DDN’s IME).
The vision proposed here is built on the belief that workflows composed of data, computing, and communication-intensive tasks should drive interfaces and hardware configurations to
better support the programming models. When delivered, this work will increase the opportunity for smarter scheduling of computing by considering storage in heterogeneous storage systems.
We illustrate the performance-impact on an example workload using a model built on measured
performance data using ESDM at DKRZ
Recommended from our members
Predicting I/O performance in HPC using artificial neural networks
The prediction of file access times is an important part for the modeling of supercomputer's storage systems. These models can be used to develop analysis tools which support the users to integrate efficient I/O behavior.
In this paper, we analyze and predict the access times of a Lustre file system from the client perspective. Therefore, we measure file access times in various test series and developed different models for predicting access times. The evaluation shows that in models utilizing artificial neural networks the average prediciton error is about 30% smaller than in linear models. A phenomenon in the distribution of file access times is of particular interest: File accesses with identical parameters show several typical access times.The typical access times usually differ by orders of magnitude and can be explained with a different processing of the file accesses in the storage system - an alternative I/O path. We investigate a method to automatically determine the alternative I/O path and quantify the significance of knowledge about the internal processing. It is shown that the prediction error is improved significantly with this approach
Recommended from our members
Data compression for climate data
The different rates of increase for computational power and storage capabilities of supercomputers turn data storage into a technical and economical problem. Because storage capabilities are lagging behind, investments and operational costs for storage systems have increased to keep up with the supercomputers' I/O requirements. One promising approach is to reduce the amount of data that is stored. In this paper, we take a look at the impact of compression on performance and costs of high performance systems. To this end, we analyze the applicability of compression on all layers of the I/O stack, that is, main memory, network and storage. Based on the Mistral system of the German Climate Computing Center (Deutsches Klimarechenzentrum, DKRZ), we illustrate potential performance improvements and cost savings. Making use of compression on a large scale can decrease investments and operational costs by 50% without negatively impacting performance. Additionally, we present ongoing work for supporting enhanced adaptive compression in the parallel distributed file system Lustre and application-specific compression
I/O performance evaluation with Parabench — programmable I/O benchmark
AbstractChoosing an appropriate cluster file system for a specific high performance computing application is challenging and depends mainly on the specific application I/O needs. There is a wide variety of I/O requirements: Some implementations require reading and writing large datasets, others out-of-core data access, or they have database access requirements. Application access patterns reflect different I/O behavior and can be used for performance testing.This paper presents the programmable I/O benchmarking tool Parabench. It has access patterns as input, which can be adapted to mimic behavior for a rich set of applications. Using this benchmarking tool, composed patterns can be automatically tested and easily compared on different local and cluster file systems. Here we introduce the design of the proposed benchmark, focusing on the Parabench programming language, which was developed for flexible pattern creation. We also demonstrate here an exemplary usage of Parabench and its capabilities to handle the POSIX and MPI-IO interfaces
Recommended from our members
Interference of billing and scheduling strategies for energy and cost savings in modern data centers
The high energy consumption of HPC systems is an obstacle for evergrowing systems. Unfortunately, energy consumption does not decrease linearly with reduced workload; therefore, energy conservation techniques have been deployed on various levels which steer the overall system. While the overall saving of energy is useful, the price of energy is not necessarily proportional to the consumption. Particularly with renewable energies, there are occasions in which the price is significantly lower. The potential of saving energy costs when using smart contracts with energy providers is lacking research. In this paper, we conduct an analysis of the potential savings when applying cost-aware schedulers to data center workloads while considering power contracts that allow for dynamic (hourly) pricing. The contributions of this paper are twofold: 1) the theoretic assessment of cost savings; 2) the development of a simulator to replay batch scheduler traces which supports flexible energy cost models and various cost-aware scheduling algorithms. This allows to approximate the energy costs savings of data centers for various scenarios including off-peak and hourly budgeted energy prices as provided by the energy spot market. An evaluation is conducted with four annual job traces from the German Climate Computing Center (DKRZ) and Leibniz Supercomputing Centre (LRZ)
Recommended from our members
A similarity study of I/O traces via string kernels
Understanding I/O for data-intense applications is the foundation for the optimization of these applications. The classification of the applications according to the expressed I/O access pattern eases the analysis. An access pattern can be seen as fingerprint of an application. In this paper, we address the classification of traces. Firstly, we convert them first into a weighted string representation. Due to the fact that string objects can be easily compared using kernel methods, we explore their use for fingerprinting I/O patterns. To improve accuracy, we propose a novel string kernel function called kast2 spectrum kernel. The similarity matrices, obtained after applying the mentioned kernel over a set of examples from a real application, were analyzed using kernel principal component analysis and hierarchical clustering. The evaluation showed that two out of four I/O access pattern groups were completely identified, while the other two groups conformed a single cluster due to the intrinsic similarity of their members. The proposed strategy can be promisingly applied to other similarity problems involving tree-like structured data
Recommended from our members
Comparison of Clang Abstract Syntax Trees using string kernels
Abstract Syntax Trees (ASTs) are intermediate representations widely used by compiler frameworks. One of their strengths is that they can be used to determine the similarity among a collection of programs. In this paper we propose a novel comparison method that converts ASTs into weighted strings in order to get similarity matrices and quantify the level of correlation among codes. To evaluate the approach, we leveraged the corresponding strings derived from the Clang ASTs of a set of 100 source code examples written in C. Our kernel and two other string kernels from the literature were used to obtain similarity matrices among those examples. Next, we used Hierarchical Clustering to visualize the results. Our solution was able to identify different clusters conformed by examples that shared similar semantics. We demonstrated that the proposed strategy can be promisingly applied to similarity problems involving trees or strings
Recommended from our members
Monitoring energy consumption with SIOX
In the face of the growing complexity of HPC systems, their growing energy costs, and the increasing difficulty to run applications efficiently, a number of monitoring tools have been developed during the last years. SIOX is one such endeavor, with a uniquely holistic approach: Not only does it aim to record a certain kind of data, but to make all relevant data available for analysis and optimization. Among other sources, this encompasses data from hardware energy counters and trace data from different hardware/software layers. However, not all data that can be recorded should be recorded. As such, SIOX needs good heuristics to determine when and what data needs to be collected, and the energy consumption can provide an important signal about when the system is in a state that deserves closer attention. In this paper, we show that SIOX can use Likwid to collect and report the energy consumption of applications, and present how this data can be visualized using SIOX’s web-interface. Furthermore, we outline how SIOX can use this information to intelligently adjust the amount of data it collects, allowing it to reduce the monitoring overhead while still providing complete information about critical situations