Search CORE

77 research outputs found

선점가능형 클라우드 클러스터에서의 비용 효율적인 머신러닝 학습

Author: 구윤모
Publication venue: 서울대학교 대학원
Publication date: 01/08/2022
Field of study

학위논문(석사) -- 서울대학교대학원 : 공과대학 컴퓨터공학부, 2022. 8. 전병곤.Due to the high cost of building a physical GPU cluster infrastructure for AI model training, the demand for training on “pay-as-you-go” public cloud clusters has increased rapidly. In particular, training AI models using preemptible(i.e., spot) VMs provided at steep price discounts has attracted the attention of many researchers. However, since cloud providers can unilaterally revoke preemptible VMs at any time, it may result in the loss of underway training states. Due to the trade-off between cost and reliability, researchers are disinclined to actively adopt preemptible VMs for their experiments. In this paper, we discuss the major challenges of AI model training on preemptible VMs and propose Spotify, an AI model training job orchestrator, which automatically deals with the challenges and enables reliable training on preemptible cloud clusters. Researchers can run training jobs on low-price preemptible clusters under the illusion of using reliable on-demand clusters. Our evaluations show that Spotify reduces the 62% of end-to-end training cost with only sacrificing 2.86% additional latency overhead compared to the training on on-demand clusters.인공지능 모델 학습을 위해 물리적으로 GPU 클러스터를 구축 및 관리하는 데에 는 많은 비용이 투자되어야 한다. 이에 따라 인공지능 모델 개발자들 사이에서는 사용한 만큼의 비용만을 지불하여 사용이 가능한 클라우드 클러스터를 사용하여 모델 학습을 하려는 수요가 점차 증가하고 있다. 특히 큰 폭의 할인된 가격으로 제공되는 선점가능형 가상머신을 사용하여 모델 학습을 하는 방식이 큰 주목을 받고 있다. 하지만 선점가능형 가상머신은 클라우드 제공사에 의해 언제든지 일방적으로 선점을 당할 수 있기 때문에 진행 중이던 학습 상태의 손실이 야기될 수 있다. 비용과 안전성 면에서 교환이 발생하기 때문에 개발자들은 선점가능형 가상머신을 모델 학습 및 실험에 적극적으로 사용하는 데 어려움을 겪고 있다. 본 연구에서는 선점가능형 가상머신에서 인공지능 모델 학습을 진행하는 데 있어 존재하는 주요한 어려움들에 대해 논의하고, 자동화된 방식을 통해 그러한 어려 움을 해결함으로써 선점가능형 클라우드 클러스터에서 안정적인 학습을 가능하게 하는 인공지능 모델 학습 작업 관리 시스템인 Spotify를 제안한다. 우리의 실험 결과는 Spotify가 선점가능형 클라우드 클러스터에서 학습을 수행할 때 온디맨드 클라우드 클러스터에서 학습을 진행하는 것 대비 2.86%의 지연시간 오버헤드만을 희생하여 최대 62%에 달하는 비용을 절약할 수 있음을 보인다.Abstract 1 1 Introduction 5 2 Background 8 2.1 Preemptible Virtual Machines 8 2.2 Model Training and Checkpointing 9 3 Challenges 12 3.1 Unpredictability of Preemptions 12 3.2 Resource Management 14 4 Modeling Checkpointing Policy 15 4.1 Approximating Optimal Checkpointing Interval 15 4.2 Emergency Save 17 4.3 Insurance Save 18 4.4 Adaptive Checkpointing 19 5 System Design 22 5.1 System Architecture and Workflow 22 5.2 API Design 25 6 Evaluation 27 6.1 Environment 27 6.1.1 Cloud VM 27 6.1.2 Job Specification 28 6.2 Evaluation Tools 28 6.2.1 Preemption Injector 28 6.2.2 Training Simulator 29 6.3 Training Performance and Cost 30 6.3.1 Efficiency of EmergencySave 30 6.3.2 Efficiency of Insurance Save 32 6.4 Effect of Preemption Frequency 35 7 Conclusion 36 초록 41석

SNU Open Repository and Archive

Extending Scojo-PECT by migration based on application level checkpointing

Author: Shi Jiaying
Publication venue: 'University of Windsor Leddy Library'
Publication date: 01/01/2009
Field of study

In parallel computing, jobs have different runtimes and required computation resources. With runtimes correlated with resources, scheduling these jobs would be a packing problem getting the utilization and total execution time varies. Sometimes, resources are idle while jobs are preempted or have resource conflict with no chance to take use of them. This greatly wastes system resource at certain degree. Here we propose an approach which takes periodic checkpoints of running jobs with the chance to take advantage of migration to optimize our scheduler during long term scheduling. We improve our original Scojo-PECT preemptive scheduler which does not have checkpoint support before. We evaluate the gained execution time minus overhead of checkpointing/migration, to make comparison with original execution time

Scholarship at UWindsor

Khaos: Dynamically Optimizing Checkpointing for Dependable Distributed Stream Processing

Author: Geldenhuys Morgan K.
Kao Odej
Pfister Benjamin J. J.
Scheinert Dominik
Thamsen Lauritz
Publication venue
Publication date: 03/08/2022
Field of study

Distributed Stream Processing systems are becoming an increasingly essential part of Big Data processing platforms as users grow ever more reliant on their ability to provide fast access to new results. As such, making timely decisions based on these results is dependent on a system's ability to tolerate failure. Typically, these systems achieve fault tolerance and the ability to recover automatically from partial failures by implementing checkpoint and rollback recovery. However, owing to the statistical probability of partial failures occurring in these distributed environments and the variability of workloads upon which jobs are expected to operate, static configurations will often not meet Quality of Service constraints with low overhead. In this paper we present Khaos, a new approach which utilizes the parallel processing capabilities of virtual cloud automation technologies for the automatic runtime optimization of fault tolerance configurations in Distributed Stream Processing jobs. Our approach employs three subsequent phases which borrows from the principles of Chaos Engineering: establish the steady-state processing conditions, conduct experiments to better understand how the system performs under failure, and use this knowledge to continuously minimize Quality of Service violations. We implemented Khaos prototypically together with Apache Flink and demonstrate its usefulness experimentally

arXiv.org e-Print Archive

CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance

Author: Hager Georg
Kreutzer Moritz
Shahzad Faisal
Thies Jonas
Wellein Gerhard
Zeiser Thomas
Publication venue
Publication date: 07/08/2017
Field of study

In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency but it takes a lot of implementation effort. This work presents the implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic Fault Tolerance), which serves two purposes. First, it provides an extendable library that significantly eases the implementation of application-level checkpointing. The most basic and frequently used checkpoint data types are already part of CRAFT and can be directly used out of the box. The library can be easily extended to add more data types. As means of overhead reduction, the library offers a build-in asynchronous checkpointing mechanism and also supports the Scalable Checkpoint/Restart (SCR) library for node level checkpointing. Second, CRAFT provides an easier interface for User-Level Failure Mitigation (ULFM) based dynamic process recovery, which significantly reduces the complexity and effort of failure detection and communication recovery mechanism. By utilizing both functionalities together, applications can write application-level checkpoints and recover dynamically from process failures with very limited programming effort. This work presents the design and use of our library in detail. The associated overheads are thoroughly analyzed using several benchmarks

arXiv.org e-Print Archive

Institute of Transport Research:Publications

Near-optimal scheduling and decision-making models for reactive and proactive fault tolerance mechanisms

Author: Naksinehaboon Nichamon
Publication venue: Louisiana Tech Digital Commons
Publication date: 01/04/2012
Field of study

As High Performance Computing (HPC) systems increase in size to fulfill computational power demand, the chance of failure occurrences dramatically increases, resulting in potentially large amounts of lost computing time. Fault Tolerance (FT) mechanisms aim to mitigate the impact of failure occurrences to the running applications. However, the overhead of FT mechanisms increases proportionally to the HPC systems\u27 size. Therefore, challenges arise in handling the expensive overhead of FT mechanisms while minimizing the large amount of lost computing time due to failure occurrences. In this dissertation, a near-optimal scheduling model is built to determine when to invoke a hybrid checkpoint mechanism, by means of stochastic processes and calculus of variations. The obtained schedule minimizes the waste time caused by checkpoint mechanism and failure occurrences. Generally, the checkpoint/restart mechanisms periodically save application states and load the saved state, upon failure occurrences. Furthermore, to handle various FT mechanisms, an adaptive decision-making model has been developed to determine the best FT strategy to invoke at each decision point. The best mechanism at each decision point is selected among considered FT mechanisms to globally minimize the total waste time for an application execution by means of a dynamic programming approach. In addition, the model is adaptive to deal with changes in failure rate over time

Louisiana Tech Digital Commons

Checkpointing algorithms and fault prediction

Author: Aupy Guillaume
Robert Yves
Vivien Frédéric
Zaidouni Dounia
Publication venue: 'Elsevier BV'
Publication date: 14/02/2013
Field of study

This paper deals with the impact of fault prediction techniques on checkpointing strategies. We extend the classical first-order analysis of Young and Daly in the presence of a fault prediction system, characterized by its recall and its precision. In this framework, we provide an optimal algorithm to decide when to take predictions into account, and we derive the optimal value of the checkpointing period. These results allow to analytically assess the key parameters that impact the performance of fault predictors at very large scale.Comment: Supported in part by ANR Rescue. Published in Journal of Parallel and Distributed Computing. arXiv admin note: text overlap with arXiv:1207.693

arXiv.org e-Print Archive

HAL-ENS-LYON

INRIA a CCSD electronic archive server

Hal-Diderot