77 research outputs found

    ์„ ์ ๊ฐ€๋Šฅํ˜• ํด๋ผ์šฐ๋“œ ํด๋Ÿฌ์Šคํ„ฐ์—์„œ์˜ ๋น„์šฉ ํšจ์œจ์ ์ธ ๋จธ์‹ ๋Ÿฌ๋‹ ํ•™์Šต

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2022. 8. ์ „๋ณ‘๊ณค.Due to the high cost of building a physical GPU cluster infrastructure for AI model training, the demand for training on โ€œpay-as-you-goโ€ public cloud clusters has increased rapidly. In particular, training AI models using preemptible(i.e., spot) VMs provided at steep price discounts has attracted the attention of many researchers. However, since cloud providers can unilaterally revoke preemptible VMs at any time, it may result in the loss of underway training states. Due to the trade-off between cost and reliability, researchers are disinclined to actively adopt preemptible VMs for their experiments. In this paper, we discuss the major challenges of AI model training on preemptible VMs and propose Spotify, an AI model training job orchestrator, which automatically deals with the challenges and enables reliable training on preemptible cloud clusters. Researchers can run training jobs on low-price preemptible clusters under the illusion of using reliable on-demand clusters. Our evaluations show that Spotify reduces the 62% of end-to-end training cost with only sacrificing 2.86% additional latency overhead compared to the training on on-demand clusters.์ธ๊ณต์ง€๋Šฅ ๋ชจ๋ธ ํ•™์Šต์„ ์œ„ํ•ด ๋ฌผ๋ฆฌ์ ์œผ๋กœ GPU ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๊ตฌ์ถ• ๋ฐ ๊ด€๋ฆฌํ•˜๋Š” ๋ฐ์— ๋Š” ๋งŽ์€ ๋น„์šฉ์ด ํˆฌ์ž๋˜์–ด์•ผ ํ•œ๋‹ค. ์ด์— ๋”ฐ๋ผ ์ธ๊ณต์ง€๋Šฅ ๋ชจ๋ธ ๊ฐœ๋ฐœ์ž๋“ค ์‚ฌ์ด์—์„œ๋Š” ์‚ฌ์šฉํ•œ ๋งŒํผ์˜ ๋น„์šฉ๋งŒ์„ ์ง€๋ถˆํ•˜์—ฌ ์‚ฌ์šฉ์ด ๊ฐ€๋Šฅํ•œ ํด๋ผ์šฐ๋“œ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ ํ•™์Šต์„ ํ•˜๋ ค๋Š” ์ˆ˜์š”๊ฐ€ ์ ์ฐจ ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ๋‹ค. ํŠนํžˆ ํฐ ํญ์˜ ํ• ์ธ๋œ ๊ฐ€๊ฒฉ์œผ๋กœ ์ œ๊ณต๋˜๋Š” ์„ ์ ๊ฐ€๋Šฅํ˜• ๊ฐ€์ƒ๋จธ์‹ ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ ํ•™์Šต์„ ํ•˜๋Š” ๋ฐฉ์‹์ด ํฐ ์ฃผ๋ชฉ์„ ๋ฐ›๊ณ  ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ์„ ์ ๊ฐ€๋Šฅํ˜• ๊ฐ€์ƒ๋จธ์‹ ์€ ํด๋ผ์šฐ๋“œ ์ œ๊ณต์‚ฌ์— ์˜ํ•ด ์–ธ์ œ๋“ ์ง€ ์ผ๋ฐฉ์ ์œผ๋กœ ์„ ์ ์„ ๋‹นํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ง„ํ–‰ ์ค‘์ด๋˜ ํ•™์Šต ์ƒํƒœ์˜ ์†์‹ค์ด ์•ผ๊ธฐ๋  ์ˆ˜ ์žˆ๋‹ค. ๋น„์šฉ๊ณผ ์•ˆ์ „์„ฑ ๋ฉด์—์„œ ๊ตํ™˜์ด ๋ฐœ์ƒํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ฐœ๋ฐœ์ž๋“ค์€ ์„ ์ ๊ฐ€๋Šฅํ˜• ๊ฐ€์ƒ๋จธ์‹ ์„ ๋ชจ๋ธ ํ•™์Šต ๋ฐ ์‹คํ—˜์— ์ ๊ทน์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋ฐ ์–ด๋ ค์›€์„ ๊ฒช๊ณ  ์žˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์„ ์ ๊ฐ€๋Šฅํ˜• ๊ฐ€์ƒ๋จธ์‹ ์—์„œ ์ธ๊ณต์ง€๋Šฅ ๋ชจ๋ธ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๋Š” ๋ฐ ์žˆ์–ด ์กด์žฌํ•˜๋Š” ์ฃผ์š”ํ•œ ์–ด๋ ค์›€๋“ค์— ๋Œ€ํ•ด ๋…ผ์˜ํ•˜๊ณ , ์ž๋™ํ™”๋œ ๋ฐฉ์‹์„ ํ†ตํ•ด ๊ทธ๋Ÿฌํ•œ ์–ด๋ ค ์›€์„ ํ•ด๊ฒฐํ•จ์œผ๋กœ์จ ์„ ์ ๊ฐ€๋Šฅํ˜• ํด๋ผ์šฐ๋“œ ํด๋Ÿฌ์Šคํ„ฐ์—์„œ ์•ˆ์ •์ ์ธ ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ์ธ๊ณต์ง€๋Šฅ ๋ชจ๋ธ ํ•™์Šต ์ž‘์—… ๊ด€๋ฆฌ ์‹œ์Šคํ…œ์ธ Spotify๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์šฐ๋ฆฌ์˜ ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” Spotify๊ฐ€ ์„ ์ ๊ฐ€๋Šฅํ˜• ํด๋ผ์šฐ๋“œ ํด๋Ÿฌ์Šคํ„ฐ์—์„œ ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•  ๋•Œ ์˜จ๋””๋งจ๋“œ ํด๋ผ์šฐ๋“œ ํด๋Ÿฌ์Šคํ„ฐ์—์„œ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ ๋Œ€๋น„ 2.86%์˜ ์ง€์—ฐ์‹œ๊ฐ„ ์˜ค๋ฒ„ํ—ค๋“œ๋งŒ์„ ํฌ์ƒํ•˜์—ฌ ์ตœ๋Œ€ 62%์— ๋‹ฌํ•˜๋Š” ๋น„์šฉ์„ ์ ˆ์•ฝํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค.Abstract 1 1 Introduction 5 2 Background 8 2.1 Preemptible Virtual Machines 8 2.2 Model Training and Checkpointing 9 3 Challenges 12 3.1 Unpredictability of Preemptions 12 3.2 Resource Management 14 4 Modeling Checkpointing Policy 15 4.1 Approximating Optimal Checkpointing Interval 15 4.2 Emergency Save 17 4.3 Insurance Save 18 4.4 Adaptive Checkpointing 19 5 System Design 22 5.1 System Architecture and Workflow 22 5.2 API Design 25 6 Evaluation 27 6.1 Environment 27 6.1.1 Cloud VM 27 6.1.2 Job Specification 28 6.2 Evaluation Tools 28 6.2.1 Preemption Injector 28 6.2.2 Training Simulator 29 6.3 Training Performance and Cost 30 6.3.1 Efficiency of EmergencySave 30 6.3.2 Efficiency of Insurance Save 32 6.4 Effect of Preemption Frequency 35 7 Conclusion 36 ์ดˆ๋ก 41์„

    Extending Scojo-PECT by migration based on application level checkpointing

    Get PDF
    In parallel computing, jobs have different runtimes and required computation resources. With runtimes correlated with resources, scheduling these jobs would be a packing problem getting the utilization and total execution time varies. Sometimes, resources are idle while jobs are preempted or have resource conflict with no chance to take use of them. This greatly wastes system resource at certain degree. Here we propose an approach which takes periodic checkpoints of running jobs with the chance to take advantage of migration to optimize our scheduler during long term scheduling. We improve our original Scojo-PECT preemptive scheduler which does not have checkpoint support before. We evaluate the gained execution time minus overhead of checkpointing/migration, to make comparison with original execution time

    Khaos: Dynamically Optimizing Checkpointing for Dependable Distributed Stream Processing

    Full text link
    Distributed Stream Processing systems are becoming an increasingly essential part of Big Data processing platforms as users grow ever more reliant on their ability to provide fast access to new results. As such, making timely decisions based on these results is dependent on a system's ability to tolerate failure. Typically, these systems achieve fault tolerance and the ability to recover automatically from partial failures by implementing checkpoint and rollback recovery. However, owing to the statistical probability of partial failures occurring in these distributed environments and the variability of workloads upon which jobs are expected to operate, static configurations will often not meet Quality of Service constraints with low overhead. In this paper we present Khaos, a new approach which utilizes the parallel processing capabilities of virtual cloud automation technologies for the automatic runtime optimization of fault tolerance configurations in Distributed Stream Processing jobs. Our approach employs three subsequent phases which borrows from the principles of Chaos Engineering: establish the steady-state processing conditions, conduct experiments to better understand how the system performs under failure, and use this knowledge to continuously minimize Quality of Service violations. We implemented Khaos prototypically together with Apache Flink and demonstrate its usefulness experimentally

    CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance

    Get PDF
    In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency but it takes a lot of implementation effort. This work presents the implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic Fault Tolerance), which serves two purposes. First, it provides an extendable library that significantly eases the implementation of application-level checkpointing. The most basic and frequently used checkpoint data types are already part of CRAFT and can be directly used out of the box. The library can be easily extended to add more data types. As means of overhead reduction, the library offers a build-in asynchronous checkpointing mechanism and also supports the Scalable Checkpoint/Restart (SCR) library for node level checkpointing. Second, CRAFT provides an easier interface for User-Level Failure Mitigation (ULFM) based dynamic process recovery, which significantly reduces the complexity and effort of failure detection and communication recovery mechanism. By utilizing both functionalities together, applications can write application-level checkpoints and recover dynamically from process failures with very limited programming effort. This work presents the design and use of our library in detail. The associated overheads are thoroughly analyzed using several benchmarks

    Near-optimal scheduling and decision-making models for reactive and proactive fault tolerance mechanisms

    Get PDF
    As High Performance Computing (HPC) systems increase in size to fulfill computational power demand, the chance of failure occurrences dramatically increases, resulting in potentially large amounts of lost computing time. Fault Tolerance (FT) mechanisms aim to mitigate the impact of failure occurrences to the running applications. However, the overhead of FT mechanisms increases proportionally to the HPC systems\u27 size. Therefore, challenges arise in handling the expensive overhead of FT mechanisms while minimizing the large amount of lost computing time due to failure occurrences. In this dissertation, a near-optimal scheduling model is built to determine when to invoke a hybrid checkpoint mechanism, by means of stochastic processes and calculus of variations. The obtained schedule minimizes the waste time caused by checkpoint mechanism and failure occurrences. Generally, the checkpoint/restart mechanisms periodically save application states and load the saved state, upon failure occurrences. Furthermore, to handle various FT mechanisms, an adaptive decision-making model has been developed to determine the best FT strategy to invoke at each decision point. The best mechanism at each decision point is selected among considered FT mechanisms to globally minimize the total waste time for an application execution by means of a dynamic programming approach. In addition, the model is adaptive to deal with changes in failure rate over time

    Checkpointing algorithms and fault prediction

    Get PDF
    This paper deals with the impact of fault prediction techniques on checkpointing strategies. We extend the classical first-order analysis of Young and Daly in the presence of a fault prediction system, characterized by its recall and its precision. In this framework, we provide an optimal algorithm to decide when to take predictions into account, and we derive the optimal value of the checkpointing period. These results allow to analytically assess the key parameters that impact the performance of fault predictors at very large scale.Comment: Supported in part by ANR Rescue. Published in Journal of Parallel and Distributed Computing. arXiv admin note: text overlap with arXiv:1207.693
    • โ€ฆ
    corecore