77 research outputs found
์ ์ ๊ฐ๋ฅํ ํด๋ผ์ฐ๋ ํด๋ฌ์คํฐ์์์ ๋น์ฉ ํจ์จ์ ์ธ ๋จธ์ ๋ฌ๋ ํ์ต
ํ์๋
ผ๋ฌธ(์์ฌ) -- ์์ธ๋ํ๊ต๋ํ์ : ๊ณต๊ณผ๋ํ ์ปดํจํฐ๊ณตํ๋ถ, 2022. 8. ์ ๋ณ๊ณค.Due to the high cost of building a physical GPU cluster infrastructure for AI model training, the demand for training on โpay-as-you-goโ public cloud clusters has increased rapidly. In particular, training AI models using preemptible(i.e., spot) VMs provided at steep price discounts has attracted the attention of many researchers. However, since cloud providers can unilaterally revoke preemptible VMs at any time, it may result in the loss of underway training states. Due to the trade-off between cost and reliability, researchers are disinclined to actively adopt preemptible VMs for their experiments. In this paper, we discuss the major challenges of AI model training on preemptible VMs and propose Spotify, an AI model training job orchestrator, which automatically deals with the challenges and enables reliable training on preemptible cloud clusters. Researchers can run training jobs on low-price preemptible clusters under the illusion of using reliable on-demand clusters. Our evaluations show that Spotify reduces the 62% of end-to-end training cost with only sacrificing 2.86% additional latency overhead compared to the training on on-demand clusters.์ธ๊ณต์ง๋ฅ ๋ชจ๋ธ ํ์ต์ ์ํด ๋ฌผ๋ฆฌ์ ์ผ๋ก GPU ํด๋ฌ์คํฐ๋ฅผ ๊ตฌ์ถ ๋ฐ ๊ด๋ฆฌํ๋ ๋ฐ์ ๋ ๋ง์ ๋น์ฉ์ด ํฌ์๋์ด์ผ ํ๋ค. ์ด์ ๋ฐ๋ผ ์ธ๊ณต์ง๋ฅ ๋ชจ๋ธ ๊ฐ๋ฐ์๋ค ์ฌ์ด์์๋ ์ฌ์ฉํ ๋งํผ์ ๋น์ฉ๋ง์ ์ง๋ถํ์ฌ ์ฌ์ฉ์ด ๊ฐ๋ฅํ ํด๋ผ์ฐ๋ ํด๋ฌ์คํฐ๋ฅผ ์ฌ์ฉํ์ฌ ๋ชจ๋ธ ํ์ต์ ํ๋ ค๋ ์์๊ฐ ์ ์ฐจ ์ฆ๊ฐํ๊ณ ์๋ค. ํนํ ํฐ ํญ์ ํ ์ธ๋ ๊ฐ๊ฒฉ์ผ๋ก ์ ๊ณต๋๋ ์ ์ ๊ฐ๋ฅํ ๊ฐ์๋จธ์ ์ ์ฌ์ฉํ์ฌ ๋ชจ๋ธ ํ์ต์ ํ๋ ๋ฐฉ์์ด ํฐ ์ฃผ๋ชฉ์ ๋ฐ๊ณ ์๋ค. ํ์ง๋ง ์ ์ ๊ฐ๋ฅํ ๊ฐ์๋จธ์ ์ ํด๋ผ์ฐ๋ ์ ๊ณต์ฌ์ ์ํด ์ธ์ ๋ ์ง ์ผ๋ฐฉ์ ์ผ๋ก ์ ์ ์ ๋นํ ์ ์๊ธฐ ๋๋ฌธ์ ์งํ ์ค์ด๋ ํ์ต ์ํ์ ์์ค์ด ์ผ๊ธฐ๋ ์ ์๋ค. ๋น์ฉ๊ณผ ์์ ์ฑ ๋ฉด์์ ๊ตํ์ด ๋ฐ์ํ๊ธฐ ๋๋ฌธ์ ๊ฐ๋ฐ์๋ค์ ์ ์ ๊ฐ๋ฅํ ๊ฐ์๋จธ์ ์ ๋ชจ๋ธ ํ์ต ๋ฐ ์คํ์ ์ ๊ทน์ ์ผ๋ก ์ฌ์ฉํ๋ ๋ฐ ์ด๋ ค์์ ๊ฒช๊ณ ์๋ค. ๋ณธ ์ฐ๊ตฌ์์๋ ์ ์ ๊ฐ๋ฅํ ๊ฐ์๋จธ์ ์์ ์ธ๊ณต์ง๋ฅ ๋ชจ๋ธ ํ์ต์ ์งํํ๋ ๋ฐ ์์ด ์กด์ฌํ๋ ์ฃผ์ํ ์ด๋ ค์๋ค์ ๋ํด ๋
ผ์ํ๊ณ , ์๋ํ๋ ๋ฐฉ์์ ํตํด ๊ทธ๋ฌํ ์ด๋ ค ์์ ํด๊ฒฐํจ์ผ๋ก์จ ์ ์ ๊ฐ๋ฅํ ํด๋ผ์ฐ๋ ํด๋ฌ์คํฐ์์ ์์ ์ ์ธ ํ์ต์ ๊ฐ๋ฅํ๊ฒ ํ๋ ์ธ๊ณต์ง๋ฅ ๋ชจ๋ธ ํ์ต ์์
๊ด๋ฆฌ ์์คํ
์ธ Spotify๋ฅผ ์ ์ํ๋ค. ์ฐ๋ฆฌ์ ์คํ ๊ฒฐ๊ณผ๋ Spotify๊ฐ ์ ์ ๊ฐ๋ฅํ ํด๋ผ์ฐ๋ ํด๋ฌ์คํฐ์์ ํ์ต์ ์ํํ ๋ ์จ๋๋งจ๋ ํด๋ผ์ฐ๋ ํด๋ฌ์คํฐ์์ ํ์ต์ ์งํํ๋ ๊ฒ ๋๋น 2.86%์ ์ง์ฐ์๊ฐ ์ค๋ฒํค๋๋ง์ ํฌ์ํ์ฌ ์ต๋ 62%์ ๋ฌํ๋ ๋น์ฉ์ ์ ์ฝํ ์ ์์์ ๋ณด์ธ๋ค.Abstract 1
1 Introduction 5
2 Background 8
2.1 Preemptible Virtual Machines 8
2.2 Model Training and Checkpointing 9
3 Challenges 12
3.1 Unpredictability of Preemptions 12
3.2 Resource Management 14
4 Modeling Checkpointing Policy 15
4.1 Approximating Optimal Checkpointing Interval 15
4.2 Emergency Save 17
4.3 Insurance Save 18
4.4 Adaptive Checkpointing 19
5 System Design 22
5.1 System Architecture and Workflow 22
5.2 API Design 25
6 Evaluation 27
6.1 Environment 27
6.1.1 Cloud VM 27
6.1.2 Job Specification 28
6.2 Evaluation Tools 28
6.2.1 Preemption Injector 28
6.2.2 Training Simulator 29
6.3 Training Performance and Cost 30
6.3.1 Efficiency of EmergencySave 30
6.3.2 Efficiency of Insurance Save 32
6.4 Effect of Preemption Frequency 35
7 Conclusion 36
์ด๋ก 41์
Extending Scojo-PECT by migration based on application level checkpointing
In parallel computing, jobs have different runtimes and required computation resources. With runtimes correlated with resources, scheduling these jobs would be a packing problem getting the utilization and total execution time varies. Sometimes, resources are idle while jobs are preempted or have resource conflict with no chance to take use of them. This greatly wastes system resource at certain degree.
Here we propose an approach which takes periodic checkpoints of running jobs with the chance to take advantage of migration to optimize our scheduler during long term scheduling. We improve our original Scojo-PECT preemptive scheduler which does not have checkpoint support before. We evaluate the gained execution time minus overhead of checkpointing/migration, to make comparison with original execution time
Khaos: Dynamically Optimizing Checkpointing for Dependable Distributed Stream Processing
Distributed Stream Processing systems are becoming an increasingly essential
part of Big Data processing platforms as users grow ever more reliant on their
ability to provide fast access to new results. As such, making timely decisions
based on these results is dependent on a system's ability to tolerate failure.
Typically, these systems achieve fault tolerance and the ability to recover
automatically from partial failures by implementing checkpoint and rollback
recovery. However, owing to the statistical probability of partial failures
occurring in these distributed environments and the variability of workloads
upon which jobs are expected to operate, static configurations will often not
meet Quality of Service constraints with low overhead.
In this paper we present Khaos, a new approach which utilizes the parallel
processing capabilities of virtual cloud automation technologies for the
automatic runtime optimization of fault tolerance configurations in Distributed
Stream Processing jobs. Our approach employs three subsequent phases which
borrows from the principles of Chaos Engineering: establish the steady-state
processing conditions, conduct experiments to better understand how the system
performs under failure, and use this knowledge to continuously minimize Quality
of Service violations. We implemented Khaos prototypically together with Apache
Flink and demonstrate its usefulness experimentally
CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance
In order to efficiently use the future generations of supercomputers, fault
tolerance and power consumption are two of the prime challenges anticipated by
the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has
been and still is the most widely used technique to deal with hard failures.
Application-level CR is the most effective CR technique in terms of overhead
efficiency but it takes a lot of implementation effort. This work presents the
implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic
Fault Tolerance), which serves two purposes. First, it provides an extendable
library that significantly eases the implementation of application-level
checkpointing. The most basic and frequently used checkpoint data types are
already part of CRAFT and can be directly used out of the box. The library can
be easily extended to add more data types. As means of overhead reduction, the
library offers a build-in asynchronous checkpointing mechanism and also
supports the Scalable Checkpoint/Restart (SCR) library for node level
checkpointing. Second, CRAFT provides an easier interface for User-Level
Failure Mitigation (ULFM) based dynamic process recovery, which significantly
reduces the complexity and effort of failure detection and communication
recovery mechanism. By utilizing both functionalities together, applications
can write application-level checkpoints and recover dynamically from process
failures with very limited programming effort. This work presents the design
and use of our library in detail. The associated overheads are thoroughly
analyzed using several benchmarks
Near-optimal scheduling and decision-making models for reactive and proactive fault tolerance mechanisms
As High Performance Computing (HPC) systems increase in size to fulfill computational power demand, the chance of failure occurrences dramatically increases, resulting in potentially large amounts of lost computing time. Fault Tolerance (FT) mechanisms aim to mitigate the impact of failure occurrences to the running applications. However, the overhead of FT mechanisms increases proportionally to the HPC systems\u27 size. Therefore, challenges arise in handling the expensive overhead of FT mechanisms while minimizing the large amount of lost computing time due to failure occurrences.
In this dissertation, a near-optimal scheduling model is built to determine when to invoke a hybrid checkpoint mechanism, by means of stochastic processes and calculus of variations. The obtained schedule minimizes the waste time caused by checkpoint mechanism and failure occurrences. Generally, the checkpoint/restart mechanisms periodically save application states and load the saved state, upon failure occurrences. Furthermore, to handle various FT mechanisms, an adaptive decision-making model has been developed to determine the best FT strategy to invoke at each decision point. The best mechanism at each decision point is selected among considered FT mechanisms to globally minimize the total waste time for an application execution by means of a dynamic programming approach. In addition, the model is adaptive to deal with changes in failure rate over time
Checkpointing algorithms and fault prediction
This paper deals with the impact of fault prediction techniques on
checkpointing strategies. We extend the classical first-order analysis of Young
and Daly in the presence of a fault prediction system, characterized by its
recall and its precision. In this framework, we provide an optimal algorithm to
decide when to take predictions into account, and we derive the optimal value
of the checkpointing period. These results allow to analytically assess the key
parameters that impact the performance of fault predictors at very large scale.Comment: Supported in part by ANR Rescue. Published in Journal of Parallel and
Distributed Computing. arXiv admin note: text overlap with arXiv:1207.693
- โฆ