Search CORE

222 research outputs found

고성능 컴퓨팅 시스템에서 버스트 버퍼를 위한 I/O 분리 기법의 실증적 구현

Author: 구동훈
Publication venue: 서울대학교 대학원
Publication date: 01/08/2019
Field of study

학위논문(석사)--서울대학교 대학원 :공과대학 컴퓨터공학부,2019. 8. 엄현상.To meet the exascale I/O requirements in the High-Performance Computing (HPC), a new I/O subsystem, named Burst Buffer, based on non-volatile memory, has been developed. However, the diverse HPC workloads and the bursty I/O pattern cause severe data fragmentation to SSDs, which creates the need for expensive garbage collection (GC) and also increase the number of bytes actually written to SSD. The new multi-stream feature in SSDs offers an option to reduce the cost of garbage collection. In this paper, we leverage this multi-stream feature to group the I/O streams based on the user IDs and implement this strategy in a burst buffer we call BIOS, short for Burst Buffer with an I/O Separation scheme. Furthermore, to optimize the I/O separation scheme in burst buffer environments, we propose a stream-aware scheduling policy based on burst buffer pools in workload manager and implement the real burst buffer system, BIOS framework, by integrating the BIOS with workload manager. We evaluate the BIOS and framework with a burst buffer I/O traces from Cori Supercomputer including a diverse set of applications. We also disclose and analyze the benefits and limitations of using I/O separation scheme in HPC systems. Experimental results show that the BIOS could improve the performance by 1.44× on average and reduce the Write Amplification Factor (WAF) by up to 1.20×, and prove that the framework can keep on the benefits of the I/O separation scheme in the HPC environment.Abstract Introduction 1 Background and Challenges 5 Burst Buffer 5 Write Amplification in SSDs 6 Multi-streamed SSD 7 Challenges of Multi-stream Feature in Burst Buffers 7 I/O Separation Scheme in Burst Buffer 10 Stream Allocation Criteria 10 Implementation 12 Limitations of User ID-based Stream Allocation 14 BIOS Framework 15 Support in Workload Manager 15 Burst Buffer Pools 16 Stream-Aware Scheduling Policy 18 Workflow of BIOS Framework 20 Evaluation 21 Experiment Setup 21 Evaluation with Synthetic Workload 21 Evaluation with HPC Applications 25 Evaluation with Emulated Workload 27 Evaluation with Different Striping Configuration 29 Evaluation on BIOS Framework 30 Summary and Lessons Learned 33 An I/O Separation Scheme in Burst Buffer 33 Evaluation with Synthetic Workload 33 Evaluation with HPC Applications 33 Evaluation with Emulated Workload 34 Evaluation with Striping Configurations 34 A BIOS Framework 34 Evaluation with Real Burst Buffer Environments 34 Discussion 36 Limited Number of Nodes 36 Advanced BIOS Framework 37 Related work 38 Conclusions 40 Bibliography 42 초록 48Maste

SNU Open Repository and Archive

Characterizing Deep-Learning I/O Workloads in TensorFlow

Author: Chien Steven W. D.
Herman Pawel
Laure Erwin
Markidis Stefano
Narasimhamurthy Sai
Santos Luis
Sishtla Chaitanya Prasad
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 05/10/2018
Field of study

The performance of Deep-Learning (DL) computing frameworks rely on the performance of data ingestion and checkpointing. In fact, during the training, a considerable high number of relatively small files are first loaded and pre-processed on CPUs and then moved to accelerator for computation. In addition, checkpointing and restart operations are carried out to allow DL computing frameworks to restart quickly from a checkpoint. Because of this, I/O affects the performance of DL applications. In this work, we characterize the I/O performance and scaling of TensorFlow, an open-source programming framework developed by Google and specifically designed for solving DL problems. To measure TensorFlow I/O performance, we first design a micro-benchmark to measure TensorFlow reads, and then use a TensorFlow mini-application based on AlexNet to measure the performance cost of I/O and checkpointing in TensorFlow. To improve the checkpointing performance, we design and implement a burst buffer. We find that increasing the number of threads increases TensorFlow bandwidth by a maximum of 2.3x and 7.8x on our benchmark environments. The use of the tensorFlow prefetcher results in a complete overlap of computation on accelerator and input pipeline on CPU eliminating the effective cost of I/O on the overall performance. The use of a burst buffer to checkpoint to a fast small capacity storage and copy asynchronously the checkpoints to a slower large capacity storage resulted in a performance improvement of 2.6x with respect to checkpointing directly to slower storage on our benchmark environment.Comment: Accepted for publication at pdsw-DISCS 201

arXiv.org e-Print Archive

Crossref

Scipedia

Recommended from our members

Benefit of DDN's IME-FUSE for I/O intensive HPC applications

Author: Betke Eugen
Kunkel Julian
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2018
Field of study

Many scientific applications are limited by I/O performance offered by parallel file systems on conventional storage systems. Flash- based burst buffers provide significant better performance than HDD backed storage, but at the expense of capacity. Burst buffers are consid- ered as the next step towards achieving wire-speed of interconnect and providing more predictable low latency I/O, which are the holy grail of storage. A critical evaluation of storage technology is mandatory as there is no long-term experience with performance behavior for particular applica- tions scenarios. The evaluation enables data centers choosing the right products and system architects the integration in HPC architectures. This paper investigates the native performance of DDN-IME, a flash- based burst buffer solution. Then, it takes a closer look at the IME-FUSE file systems, which uses IMEs as burst buffer and a Lustre file system as back-end. Finally, by utilizing a NetCDF benchmark, it estimates the performance benefit for climate applications

Central Archive at the University of Reading

Crossref

GekkoFS: A temporary burst buffer file system for HPC applications

Author: Brinkmann Andre
Cortés Toni
Miranda Bueno Alberto
Moti Nafiseh
Nou Ramon
Süb Tim
Tacke Markus
Tocci Tommaso
Vef Marc-André
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

Many scientific fields increasingly use high-performance computing (HPC) to process and analyze massive amounts of experimental data while storage systems in today’s HPC environments have to cope with new access patterns. These patterns include many metadata operations, small I/O requests, or randomized file I/O, while general-purpose parallel file systems have been optimized for sequential shared access to large files. Burst buffer file systems create a separate file system that applications can use to store temporary data. They aggregate node-local storage available within the compute nodes or use dedicated SSD clusters and offer a peak bandwidth higher than that of the backend parallel file system without interfering with it. However, burst buffer file systems typically offer many features that a scientific application, running in isolation for a limited amount of time, does not require. We present GekkoFS, a temporary, highly-scalable file system which has been specifically optimized for the aforementioned use cases. GekkoFS provides relaxed POSIX semantics which only offers features which are actually required by most (not all) applications. GekkoFS is, therefore, able to provide scalable I/O performance and reaches millions of metadata operations already for a small number of nodes, significantly outperforming the capabilities of common parallel file systems.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Preparing HPC Applications for the Exascale Era: A Decoupling Strategy

Author: Gioiosa Roberto
Kestor Gokcen
Laure Erwin
Markidis Stefano
Peng Ivy Bo
Publication venue
Publication date: 03/08/2017
Field of study

Production-quality parallel applications are often a mixture of diverse operations, such as computation- and communication-intensive, regular and irregular, tightly coupled and loosely linked operations. In conventional construction of parallel applications, each process performs all the operations, which might result inefficient and seriously limit scalability, especially at large scale. We propose a decoupling strategy to improve the scalability of applications running on large-scale systems. Our strategy separates application operations onto groups of processes and enables a dataflow processing paradigm among the groups. This mechanism is effective in reducing the impact of load imbalance and increases the parallel efficiency by pipelining multiple operations. We provide a proof-of-concept implementation using MPI, the de-facto programming system on current supercomputers. We demonstrate the effectiveness of this strategy by decoupling the reduce, particle communication, halo exchange and I/O operations in a set of scientific and data-analytics applications. A performance evaluation on 8,192 processes of a Cray XC40 supercomputer shows that the proposed approach can achieve up to 4x performance improvement.Comment: The 46th International Conference on Parallel Processing (ICPP-2017

arXiv.org e-Print Archive

Crossref

Circuit-switch architecture for a 30/20-GHz FDMA/TDM geostationary satellite communications network

Author: Ivancic William D.
Publication venue
Publication date
Field of study

A circuit switching architecture is described for a 30/20 GHz frequency division, multiple access uplink/time division multiplexed downlink (FDMA/TDM) geostationary satellite communications network. Critical subsystems and problem areas are identified and addressed. Work was concentrated primarily on the space segment; however, the ground segment was considered concurrently to ensure cost efficiency and realistic operational constraints

NASA Technical Reports Server