222 research outputs found
고성능 컴퓨팅 시스템에서 버스트 버퍼를 위한 I/O 분리 기법의 실증적 구현
학위논문(석사)--서울대학교 대학원 :공과대학 컴퓨터공학부,2019. 8. 엄현상.To meet the exascale I/O requirements in the High-Performance Computing (HPC), a new I/O subsystem, named Burst Buffer, based on non-volatile memory, has been developed. However, the diverse HPC workloads and the bursty I/O pattern cause severe data fragmentation to SSDs, which creates the need for expensive garbage collection (GC) and also increase the number of bytes actually written to SSD. The new multi-stream feature in SSDs offers an option to reduce the cost of garbage collection. In this paper, we leverage this multi-stream feature to group the I/O streams based on the user IDs and implement this strategy in a burst buffer we call BIOS, short for Burst Buffer with an I/O Separation scheme. Furthermore, to optimize the I/O separation scheme in burst buffer environments, we propose a stream-aware scheduling policy based on burst buffer pools in workload manager and implement the real burst buffer system, BIOS framework, by integrating the BIOS with workload manager. We evaluate the BIOS and framework with a burst buffer I/O traces from Cori Supercomputer including a diverse set of applications. We also disclose and analyze the benefits and limitations of using I/O separation scheme in HPC systems. Experimental results show that the BIOS could improve the performance by 1.44× on average and reduce the Write Amplification Factor (WAF) by up to 1.20×, and prove that the framework can keep on the benefits of the I/O separation scheme in the HPC environment.Abstract
Introduction 1
Background and Challenges 5
Burst Buffer 5
Write Amplification in SSDs 6
Multi-streamed SSD 7
Challenges of Multi-stream Feature in Burst Buffers 7
I/O Separation Scheme in Burst Buffer 10
Stream Allocation Criteria 10
Implementation 12
Limitations of User ID-based Stream Allocation 14
BIOS Framework 15
Support in Workload Manager 15
Burst Buffer Pools 16
Stream-Aware Scheduling Policy 18
Workflow of BIOS Framework 20
Evaluation 21
Experiment Setup 21
Evaluation with Synthetic Workload 21
Evaluation with HPC Applications 25
Evaluation with Emulated Workload 27
Evaluation with Different Striping Configuration 29
Evaluation on BIOS Framework 30
Summary and Lessons Learned 33
An I/O Separation Scheme in Burst Buffer 33
Evaluation with Synthetic Workload 33
Evaluation with HPC Applications 33
Evaluation with Emulated Workload 34
Evaluation with Striping Configurations 34
A BIOS Framework 34
Evaluation with Real Burst Buffer Environments 34
Discussion 36
Limited Number of Nodes 36
Advanced BIOS Framework 37
Related work 38
Conclusions 40
Bibliography 42
초록 48Maste
Characterizing Deep-Learning I/O Workloads in TensorFlow
The performance of Deep-Learning (DL) computing frameworks rely on the
performance of data ingestion and checkpointing. In fact, during the training,
a considerable high number of relatively small files are first loaded and
pre-processed on CPUs and then moved to accelerator for computation. In
addition, checkpointing and restart operations are carried out to allow DL
computing frameworks to restart quickly from a checkpoint. Because of this, I/O
affects the performance of DL applications. In this work, we characterize the
I/O performance and scaling of TensorFlow, an open-source programming framework
developed by Google and specifically designed for solving DL problems. To
measure TensorFlow I/O performance, we first design a micro-benchmark to
measure TensorFlow reads, and then use a TensorFlow mini-application based on
AlexNet to measure the performance cost of I/O and checkpointing in TensorFlow.
To improve the checkpointing performance, we design and implement a burst
buffer. We find that increasing the number of threads increases TensorFlow
bandwidth by a maximum of 2.3x and 7.8x on our benchmark environments. The use
of the tensorFlow prefetcher results in a complete overlap of computation on
accelerator and input pipeline on CPU eliminating the effective cost of I/O on
the overall performance. The use of a burst buffer to checkpoint to a fast
small capacity storage and copy asynchronously the checkpoints to a slower
large capacity storage resulted in a performance improvement of 2.6x with
respect to checkpointing directly to slower storage on our benchmark
environment.Comment: Accepted for publication at pdsw-DISCS 201
Recommended from our members
Benefit of DDN's IME-FUSE for I/O intensive HPC applications
Many scientific applications are limited by I/O performance offered by parallel file systems on conventional storage systems. Flash- based burst buffers provide significant better performance than HDD backed storage, but at the expense of capacity. Burst buffers are consid- ered as the next step towards achieving wire-speed of interconnect and providing more predictable low latency I/O, which are the holy grail of storage. A critical evaluation of storage technology is mandatory as there is no long-term experience with performance behavior for particular applica- tions scenarios. The evaluation enables data centers choosing the right products and system architects the integration in HPC architectures. This paper investigates the native performance of DDN-IME, a flash- based burst buffer solution. Then, it takes a closer look at the IME-FUSE file systems, which uses IMEs as burst buffer and a Lustre file system as back-end. Finally, by utilizing a NetCDF benchmark, it estimates the performance benefit for climate applications
GekkoFS: A temporary burst buffer file system for HPC applications
Many scientific fields increasingly use high-performance computing (HPC) to process and analyze massive amounts of experimental data while storage systems in today’s HPC environments have to cope with new access patterns. These patterns include many metadata operations, small I/O requests, or randomized file I/O, while general-purpose parallel file systems have been optimized for sequential shared access to large files. Burst buffer file systems create a separate file system that applications can use to store temporary data. They aggregate node-local storage available within the compute nodes or use dedicated SSD clusters and offer a peak bandwidth higher than that of the backend parallel file system without interfering with it. However, burst buffer file systems typically offer many features that a scientific application, running in isolation for a limited amount of time, does not require. We present GekkoFS, a temporary, highly-scalable file system which has been specifically optimized for the aforementioned use cases. GekkoFS provides relaxed POSIX semantics which only offers features which are actually required by most (not all) applications. GekkoFS is, therefore, able to provide scalable I/O performance and reaches millions of metadata operations already for a small number of nodes, significantly outperforming the capabilities of common parallel file systems.Peer ReviewedPostprint (author's final draft
Preparing HPC Applications for the Exascale Era: A Decoupling Strategy
Production-quality parallel applications are often a mixture of diverse
operations, such as computation- and communication-intensive, regular and
irregular, tightly coupled and loosely linked operations. In conventional
construction of parallel applications, each process performs all the
operations, which might result inefficient and seriously limit scalability,
especially at large scale. We propose a decoupling strategy to improve the
scalability of applications running on large-scale systems.
Our strategy separates application operations onto groups of processes and
enables a dataflow processing paradigm among the groups. This mechanism is
effective in reducing the impact of load imbalance and increases the parallel
efficiency by pipelining multiple operations. We provide a proof-of-concept
implementation using MPI, the de-facto programming system on current
supercomputers. We demonstrate the effectiveness of this strategy by decoupling
the reduce, particle communication, halo exchange and I/O operations in a set
of scientific and data-analytics applications. A performance evaluation on
8,192 processes of a Cray XC40 supercomputer shows that the proposed approach
can achieve up to 4x performance improvement.Comment: The 46th International Conference on Parallel Processing (ICPP-2017
Circuit-switch architecture for a 30/20-GHz FDMA/TDM geostationary satellite communications network
A circuit switching architecture is described for a 30/20 GHz frequency division, multiple access uplink/time division multiplexed downlink (FDMA/TDM) geostationary satellite communications network. Critical subsystems and problem areas are identified and addressed. Work was concentrated primarily on the space segment; however, the ground segment was considered concurrently to ensure cost efficiency and realistic operational constraints
- …