667 research outputs found

    Characterizing Deep-Learning I/O Workloads in TensorFlow

    Full text link
    The performance of Deep-Learning (DL) computing frameworks rely on the performance of data ingestion and checkpointing. In fact, during the training, a considerable high number of relatively small files are first loaded and pre-processed on CPUs and then moved to accelerator for computation. In addition, checkpointing and restart operations are carried out to allow DL computing frameworks to restart quickly from a checkpoint. Because of this, I/O affects the performance of DL applications. In this work, we characterize the I/O performance and scaling of TensorFlow, an open-source programming framework developed by Google and specifically designed for solving DL problems. To measure TensorFlow I/O performance, we first design a micro-benchmark to measure TensorFlow reads, and then use a TensorFlow mini-application based on AlexNet to measure the performance cost of I/O and checkpointing in TensorFlow. To improve the checkpointing performance, we design and implement a burst buffer. We find that increasing the number of threads increases TensorFlow bandwidth by a maximum of 2.3x and 7.8x on our benchmark environments. The use of the tensorFlow prefetcher results in a complete overlap of computation on accelerator and input pipeline on CPU eliminating the effective cost of I/O on the overall performance. The use of a burst buffer to checkpoint to a fast small capacity storage and copy asynchronously the checkpoints to a slower large capacity storage resulted in a performance improvement of 2.6x with respect to checkpointing directly to slower storage on our benchmark environment.Comment: Accepted for publication at pdsw-DISCS 201

    ๋ฉ”๋ชจ๋ฆฌ ์Šค์™‘ ํŒจํ„ด ๋ถ„์„์„ ํ†ตํ•œ ์Šค์™‘ ์‹œ์Šคํ…œ ์ตœ์ ํ™”

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021. 2. ์—ผํ—Œ์˜.The use of memory is one of the key parts of modern computer architecture (Von Neumann architecture) but when considering limited memory, it could be the most lethal part at the same time. Advances in hardware and software are making rapid strides in areas such as Big Data, HPC and machine learning and facing new turning points, while the use of memory increases along with those advances. In the server environment, various programs share resources which leads to a shortage of resources. Memory is one of those resources and needs to be managed. When the system is out of memory, the operating system evicts some of the pages out to storage and then loads the requested pages in memory. Given that the storage performance is slower than the memory, swap-induced delay is one of the critical issues in the overall performance degradation. Therefore, we designed and implemented a swpTracer to provide visualization to trace the swap in/out movement. To check the generality of the tool, we used mlock to optimize 429.mcf of Spec CPU 2006 based on the hint from swpTracer. The optimized program executes 2 to 3 times faster than the original program in a memory scarce environment. The scope of the performance improvement with previous system calls decreases when the memory limit increases. To sustain the improvement, we build a swap- prefetch to read ahead the swapped-out pages. The optimized application with swpTracer and swap-prefetch consistently exceeds the performance of the original code by 1.5x.๋ฉ”๋ชจ๋ฆฌ์˜ ์‚ฌ์šฉ์€ ํ˜„๋Œ€ ์ปดํ“จํ„ฐ ์•„ํ‚คํ…์ฒ˜(ํฐ ๋…ธ์ด๋งŒ ์•„ํ‚คํ…์ณ)์˜ ํ•ต์‹ฌ ๋ถ€๋ถ„ ์ค‘ ํ•˜ ๋‚˜์ด๊ธฐ ๋•Œ๋ฌธ์—, ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋ถ€์กฑํ•œ ํ™˜๊ฒฝ์€ ์„ฑ๋Šฅ์— ์น˜๋ช…์ ์ธ๋‹ค. ํ•˜๋“œ์›จ์–ด์™€ ์†Œํ”„ํŠธ์›จ ์–ด์˜ ๋ฐœ์ „์œผ๋กœ ๋น…๋ฐ์ดํ„ฐ, HPC, ๋จธ์‹ ๋Ÿฌ๋‹๊ณผ ๊ฐ™์€ ๋ถ„์•ผ๋“ค์ด ๋น ๋ฅธ ์†๋„๋กœ ๋ฐœ์ „ํ•˜์—ฌ ๊ทธ์— ๋”ฐ๋ผ ๋ฉ”๋ชจ๋ฆฌ์˜ ์‚ฌ์šฉ๋Ÿ‰๋„ ์ฆ๊ฐ€ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ, ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์ œํ•œ๋œ ์ž„๋ฒ ๋””๋“œ ํ™˜๊ฒฝ ์ด๋‚˜, ์—ฌ๋Ÿฌ ์ž‘์—…์ด ๋™์‹œ์— ์ˆ˜ํ–‰๋˜๋Š” ์„œ๋ฒ„์—์„œ ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ์œผ๋กœ ์ž‘์—…์ด ์ค‘๋‹จ๋˜๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. ์‹œ์Šคํ…œ์ด๋ฉ”๋ชจ๋ฆฌ๊ฐ€๋ถ€์กฑํ•˜๋ฉด์šด์˜์ฒด์ œ๋Š”์ผ๋ถ€ํŽ˜์ด์ง€๋ฅผ์ €์žฅ์†Œ๋กœ๋‚ด๋ณด๋‚ธ๋‹ค์Œ ์š”์ฒญ๋œ ํŽ˜์ด์ง€๋ฅผ ๋ฉ”๋ชจ๋ฆฌ์— ๋กœ๋“œํ•œ๋‹ค. ์Šคํ† ๋ฆฌ์ง€ ์„ฑ๋Šฅ์ด ๋ฉ”๋ชจ๋ฆฌ๋ณด๋‹ค ๋Š๋ฆฌ๋‹ค๋Š” ์ ์— ์„œ ์Šค์™‘์— ์˜ํ•œ ์ง€์—ฐ์€ ์ „๋ฐ˜์ ์ธ ์„ฑ๋Šฅ ์ €ํ•˜์˜ ์ค‘์š”ํ•œ ๋ฌธ์ œ ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ๋”ฐ๋ผ์„œ ์Šค์™‘์ด ํ”„๋กœ๊ทธ๋žจ ์ˆ˜ํ–‰ ์‹œ๊ฐ„์— ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š๋„๋ก ํ”„๋กœ๊ทธ๋žจ์˜ ์Šค์™‘ ๋ฐœ์ƒ ์ถ”์ด๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์Šค์™‘ ๋ฐœ์ƒ์„ ์ค„์ผ ์ˆ˜ ์žˆ๋„๋ก ํžŒํŠธ๋ฅผ ์ฃผ๋Š” ๋„๊ตฌ์ธ swpTracer๋ฅผ ์„ค๊ณ„, ์‹ค ํ–‰ํ–ˆ๋‹ค. mlock์„ ์‚ฌ์šฉํ•˜์—ฌ Spec CPU 2006 ๋ฒค์น˜๋งˆํฌ ์ค‘ 429.mcf์— ์ ์šฉํ–ˆ์„ ๋•Œ ๊ธฐ์กด ํ”„๋กœ๊ทธ๋žจ ๋Œ€๋น„ 2, 3 ๋ฐฐ ์„ฑ๋Šฅ์ด ๋นจ๋ผ์กŒ๋‹ค. ๊ธฐ์กด์˜ ์‹œ์Šคํ…œ ์ฝœ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ตœ์ ํ™”ํ–ˆ์„ ๋•Œ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์‚ด์ง ๋ถ€์กฑํ•œ ๊ฒฝ์šฐ์—๋Š” ๋น„์Šทํ•œ์„ฑ๋Šฅ์„๋ณด์—ฌ์ฃผ์ง€๋งŒ, ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ 50% ๋ถ€์กฑํ•œ์ˆœ๊ฐ„๋ถ€ํ„ฐ์„ฑ๋Šฅํ–ฅ์ƒํญ์ด์ค„์—ˆ๋‹ค. ์ด๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด ์Šค์™‘ ์•„์›ƒ ๋˜์—ˆ๋˜ ํŽ˜์ด์ง€๋ฅผ ๋ฏธ๋ฆฌ ์ฝ์–ด๋‘๋Š” swap-prefetch๋ฅผ ๊ตฌํ˜„ํ–ˆ๋‹ค. ๋ฐฐ์—ด์„ 3๋ฒˆ ํšก๋‹จํ•˜๋Š” ํ”„๋กœ๊ทธ๋žจ์„ ๋Œ€์ƒ์œผ๋กœ ๋ฐฐ์—ด์˜ ํฌ๊ธฐ๋ฅผ ์กฐ์ ˆํ•˜๋ฉด์„œ swap-prefetch์˜ ์„ฑ๋Šฅ์„ ์‹œํ—˜ํ–ˆ๋‹ค. ์›๋ณธ ์ฝ”๋“œ์™€ ์‹œ์Šคํ…œ ํ•จ์ˆ˜์ธ madvise๋ฅผ ์‚ฌ์šฉ ํ–ˆ์„ ๋•Œ๋ณด๋‹ค ํ‰๊ท ์ ์œผ๋กœ 1.5 ์ข‹์•„์กŒ๋‹ค. ๋˜, swap-prefetch๋ฅผ ๋‹ค๋ฅธ ์‹œ์Šคํ…œ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ์™€ mlock๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ ํ‰๊ท  1.25๋ฐฐ ์„ฑ๋Šฅ์ด ๋นจ๋ผ์กŒ๋‹ค.Abstract Chapter 1 Introduction 1 Chapter 2 Background 4 2.1 Page Reclamation Policy . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Linux Swap Management . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Linux System Calls . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 3 Design and Implementation 8 3.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2.1 Kernel Level . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2.2 Application Level . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 4 Evaluation 15 4.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2.1 Generality of swpTracer . . . . . . . . . . . . . . . . . . . 16 4.2.2 Memory Optimization Method Comparison . . . . . . . . 17 Chapter 5 Related Work 20 Chapter 6 Conclusion 22 Bibliography ์ดˆ๋ก 28Maste

    Redesigning OP2 Compiler to Use HPX Runtime Asynchronous Techniques

    Full text link
    Maximizing parallelism level in applications can be achieved by minimizing overheads due to load imbalances and waiting time due to memory latencies. Compiler optimization is one of the most effective solutions to tackle this problem. The compiler is able to detect the data dependencies in an application and is able to analyze the specific sections of code for parallelization potential. However, all of these techniques provided with a compiler are usually applied at compile time, so they rely on static analysis, which is insufficient for achieving maximum parallelism and producing desired application scalability. One solution to address this challenge is the use of runtime methods. This strategy can be implemented by delaying certain amount of code analysis to be done at runtime. In this research, we improve the parallel application performance generated by the OP2 compiler by leveraging HPX, a C++ runtime system, to provide runtime optimizations. These optimizations include asynchronous tasking, loop interleaving, dynamic chunk sizing, and data prefetching. The results of the research were evaluated using an Airfoil application which showed a 40-50% improvement in parallel performance.Comment: 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017

    Computing server power modeling in a data center: survey,taxonomy and performance evaluation

    Full text link
    Data centers are large scale, energy-hungry infrastructure serving the increasing computational demands as the world is becoming more connected in smart cities. The emergence of advanced technologies such as cloud-based services, internet of things (IoT) and big data analytics has augmented the growth of global data centers, leading to high energy consumption. This upsurge in energy consumption of the data centers not only incurs the issue of surging high cost (operational and maintenance) but also has an adverse effect on the environment. Dynamic power management in a data center environment requires the cognizance of the correlation between the system and hardware level performance counters and the power consumption. Power consumption modeling exhibits this correlation and is crucial in designing energy-efficient optimization strategies based on resource utilization. Several works in power modeling are proposed and used in the literature. However, these power models have been evaluated using different benchmarking applications, power measurement techniques and error calculation formula on different machines. In this work, we present a taxonomy and evaluation of 24 software-based power models using a unified environment, benchmarking applications, power measurement technique and error formula, with the aim of achieving an objective comparison. We use different servers architectures to assess the impact of heterogeneity on the models' comparison. The performance analysis of these models is elaborated in the paper

    ๊ฐ€์ƒํ™” ํ™˜๊ฒฝ์„ ์œ„ํ•œ ์›๊ฒฉ ๋ฉ”๋ชจ๋ฆฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021.8. Bernhard Egger.ํด๋ผ์šฐ๋“œ ํ™˜๊ฒฝ์€ ๊ฑฐ๋Œ€ํ•œ ์—ฐ์‚ฐ ์ž์›์„ ์ƒ์‹œ ๊ฐ€๋™ํ•  ํ•„์š” ์—†๊ณ  ์›ํ•˜๋Š” ์ˆœ๊ฐ„ ์›ํ•˜๋Š” ์–‘์˜ ๋Œ€ํ•œ ์—ฐ์‚ฐ ๋น„์šฉ๋งŒ์„ ์ง€๋ถˆํ•˜๋ฉด ๋˜๊ธฐ ๋•Œ๋ฌธ์—, ์ตœ๊ทผ ์ธ๊ณต์ง€๋Šฅ ๋ฐ ๋น…๋ฐ์ดํ„ฐ ์—ฐ์‚ฐ์˜ ์œ ํ–‰์œผ๋กœ ์ธํ•ด ๊ทธ ์ˆ˜์š”๊ฐ€ ํฌ๊ฒŒ ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ํด๋ผ์šฐ๋“œ ์ปดํ“จํŒ…์˜ ๋„์ž…์œผ๋กœ์ธํ•ด ๊ณ ๊ฐ์€ ์„œ๋ฒ„ ์œ ์ง€์— ๋Œ€ํ•œ ๋น„์šฉ์„ ํฌ๊ฒŒ ์ ˆ๊ฐํ•  ์ˆ˜ ์žˆ๊ณ  ์„œ๋น„์Šค ์ œ๊ณต์ž๋Š” ์—ฐ์‚ฐ ์ž์›์˜ ์ด์šฉ ํšจ์œจ์„ ๊ทน๋Œ€ํ™” ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ๋ฐ์ดํ„ฐ์„ผํ„ฐ ์ž…์žฅ์—์„œ๋Š” ์—ฐ์‚ฐ ์ž์› ํ™œ์šฉ ํšจ์œจ์„ ๊ฐœ์„ ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•œ ๋ชฉํ‘œ๊ฐ€ ๋œ๋‹ค. ํŠนํžˆ ์ตœ๊ทผ ํญ์ฆํ•˜๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ ์„ผํ„ฐ์˜ ๊ทœ๋ชจ๋ฅผ ๊ณ ๋ คํ•˜๋ฉด ์ž‘์€ ํšจ์œจ ๊ฐœ์„ ์œผ๋กœ๋„ ๋ง‰๋Œ€ํ•œ ๊ฒฝ์ œ์  ๊ฐ€์น˜๋ฅผ ์ฐฝ์ถœ ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋ฐ์ดํ„ฐ ์„ผํ„ฐ์˜ ํšจ์œจ์€ ์œ„์น˜ ์„ ์ •, ๊ตฌ์กฐ ์„ค๊ณ„, ๋ƒ‰๊ฐ ์‹œ์Šคํ…œ, ํ•˜๋“œ์›จ์–ด ๊ตฌ์„ฑ ๋“ฑ๋“ฑ ๋‹ค์–‘ํ•œ ์š”์†Œ๋“ค์— ์˜ํ–ฅ์„ ๋ฐ›์ง€๋งŒ, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ํŠนํžˆ ์—ฐ์‚ฐ ๋ฐ ๋ฉ”๋ชจ๋ฆฌ ์ž์›์„ ๊ด€๋ฆฌํ•˜๋Š” ์†Œํ”„ํŠธ์›จ์–ด ์„ค๊ณ„ ๋ฐ ๊ตฌํ˜„์„ ๋‹ค๋ฃฌ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋ฐ์ดํ„ฐ ์„ผํ„ฐ ํšจ์œจ ๊ฐœ์„ ์„ ํš๊ธฐ์ ์œผ๋กœ ๊ฐœ์„ ํ•˜๋Š” ๋‘๊ฐ€์ง€ ์†Œํ”„ํŠธ์›จ์–ด ๊ธฐ๋ฐ˜ ๊ธฐ์ˆ ์„ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ ์งธ๋กœ ๊ฐ€์ƒํ™” ํ™˜๊ฒฝ์„ ์œ„ํ•œ ์†Œํ”„ํŠธ์›จ์–ด ๊ธฐ๋ฐ˜ ๋ฉ”๋ชจ๋ฆฌ ๋ถ„๋ฆฌ ์‹œ์Šคํ…œ์„ ์ œ์•ˆํ•œ๋‹ค. ์ตœ๊ทผ ๊ณ ์† ๋„คํŠธ์›Œํฌ์˜ ๋ฐœ์ „์œผ๋กœ ์ธํ•ด ์›๊ฒฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๋น„์šฉ์ด ํš๊ธฐ์ ์œผ๋กœ ์ค„์–ด ๋“ค์—ˆ๊ณ , ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๊ณ ์„ฑ๋Šฅ ๋„คํŠธ์›Œํ‚น ํ•˜๋“œ์›จ์–ด๋ฅผ ์ด์šฉํ•˜์—ฌ ์›๊ฒฉ ๋ฉ”๋ชจ๋ฆฌ ์œ„์—์„œ ์‹คํ–‰๋˜๋Š” ๊ฐ€์ƒ ๋จธ์‹ ์˜ ํฐ ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค. ์ œ์•ˆ๋œ ๊ธฐ์ˆ ์„ QEMU/KVM ๊ฐ€์ƒ๋จธ์‹  ํ•˜์ดํผ๋ฐ”์ด์ €๋ฅผ ํ†ตํ•ด ํ‰๊ฐ€ํ•œ ๊ฒฐ๊ณผ, ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•œ ๊ธฐ๋ฒ•์€ ๊ธฐ์กด ์‹œ์Šคํ…œ ๋Œ€๋น„ ์›๊ฒฉ ํŽ˜์ด์ง•์— ๋Œ€ํ•œ ๊ผฌ๋ฆฌ ์ง€์—ฐ์‹œ๊ฐ„์„ 98.2% ๊ฐœ์„ ํ•จ์„ ๋ณด์ธ๋‹ค. ๋˜ํ•œ ๋ž™ ๊ทœ๋ชจ์˜ ์ž‘์—…์ฒ˜๋ฆฌ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ํ†ตํ•œ ์‹คํ—˜์—์„œ, ์ œ์•ˆ๋œ ์‹œ์Šคํ…œ์€ ์ „์ฒด ์ž‘์—… ์ฒ˜๋ฆฌ ์‹œ๊ฐ„์„ ๊ธฐ์กด ์‹œ์Šคํ…œ ๋Œ€๋น„ 40.9% ์ค„์ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ ์›๊ฒฉ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ด์šฉํ•˜๋Š” ์ฆ‰๊ฐ์ ์ธ ๊ฐ€์ƒ๋จธ์‹  ์ด์ฃผ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜๋‹ค. ๊ฐ€์ƒํ™” ํ™˜๊ฒฝ์˜ ์›๊ฒฉ ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ์— ๋Œ€ํ•œ ํ™•์žฅ์€ ๊ทธ๊ฒƒ๋งŒ์œผ๋กœ ์ž์› ์ด์šฉ๋ฅ  ํ–ฅ์ƒ์— ๋Œ€ํ•ด ํฐ ๊ธฐ์—ฌ๋ฅผ ํ•˜์ง€๋งŒ, ์—ฌ์ „ํžˆ ํ•œ ์„œ๋ฒ„์—์„œ ์—ฌ๋Ÿฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์ด ๊ฒฝ์Ÿ์ ์œผ๋กœ ์ž์›์„ ์ด์šฉํ•˜๋Š” ๊ฒฝ์šฐ ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ์ €ํ•˜ ๋  ์ˆ˜ ์žˆ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•˜๋Š” ์ฆ‰๊ฐ์ ์ธ ๊ฐ€์ƒ๋จธ์‹  ์ด์ฃผ ๊ธฐ๋ฒ•์€ ์›๊ฒฉ ๋ฉ”๋ชจ๋ฆฌ ์ƒ์—์„œ ์•„์ฃผ ์ž‘์€ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ์˜ ์ „์†ก๋งŒ์œผ๋กœ ๊ฐ€์ƒ๋จธ์‹ ์˜ ์ด์ฃผ๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋ฉฐ, ๋ฉ”๋ชจ๋ฆฌ ์ƒ์— ํ‚ค์™€ ๊ฐ’์„ ์ €์žฅํ•˜๋Š” ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ๋ฒค์น˜๋งˆํฌ๋ฅผ ์‹คํ–‰ํ•˜๋Š” ๊ฐ€์ƒ๋จธ์‹ ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ํ‰๊ฐ€์—์„œ ๊ธฐ์กด ๊ธฐ๋ฒ•๋Œ€๋น„ ์‹ค์งˆ์ ์ธ ์„œ๋น„์Šค ์ค‘๋‹จ์‹œ๊ฐ„์„ ์ตœ๋Œ€ 92.6% ๊ฐœ์„ ํ•จ์„ ๋ณด์ธ๋‹ค.The raising importance of big data and artificial intelligence (AI) has led to an unprecedented shift in moving local computation into the cloud. One of the key drivers behind this transformation was the exploding cost of owning and maintaining large computing systems powerful enough to process these new workloads. Customers experience a reduced cost by renting only the required resources and only when needed, while data center operators benefit from efficiency at scale. A key factor in operating a profitable data center is a high overall utilization of its resources. Due to the scale of modern data centers, small improvements in efficiency translate to significant savings in the total cost of ownership (TCO). There are many important elements that constitute an efficient data center such as its location, architecture, cooling system, or the employed hardware. In this thesis, we focus on software-related aspects, namely the utilization of computational and memory resources. Reports from data centers operated by Alibaba and Google show that the overall resource utilization has stagnated at a level of around 50 to 60 percent over the past decade. This low average utilization is mostly attributable to peak demand-driven resource allocation despite the high variability of modern workloads in their resource usage. In other words, data centers today lack an efficient way to put idle resources that are reserved but not used to work. In this dissertation we present RackMem, a software-based solution to address the problem of low resource utilization through two main contributions. First, we introduce a disaggregated memory system tailored for virtual environments. We observe that virtual machines can use remote memory without noticeable performance degradation under moderate memory pressure on modern networking infrastructure. We implement a specialized remote paging system for QEMU/KVM that reduces the remote paging tail-latency by 98.2% in comparison to the state of the art. A job processing simulation at rack-scale shows that the total makespan can be reduced by 40.9% under our memory system. While seamless disaggregated memory helps to balance memory usage across nodes, individual nodes can still suffer overloaded resources if co-located workloads exhibit high resource usage at the same time. In a second contribution, we present a novel live migration technique for machines running on top of our remote paging system. Under this instant live migration technique, entire virtual machines can be migrated in as little as 100 milliseconds. An evaluation with in-memory key-value database workloads shows that the presented migration technique improves the state of the art by a wide margin in all key performance metrics. The presented software-based solutions lay the technical foundations that allow data center operators to significantly improve the utilization of their computational and memory resources. As future work, we propose new job schedulers and load balancers to make full use of these new technical foundations.Chapter 1. Introduction 1 1.1 Contributions of the Dissertation 3 Chapter 2. Background 5 2.1 Resource Disaggregation 5 2.2 Transparent Remote Paging 7 2.3 Remote Direct Memory Access (RDMA) 9 2.4 Live Migration of Virtual Machines 10 Chapter 3. RackMem Overview 13 3.1 RackMem Virtual Memory 13 3.2 RackMem Distributed Virtual Storage 14 3.3 RackMem Networking 15 3.4 Instant VM Live Migration 16 Chapter 4. Virtual Memory 17 4.1 Design Considerations for Achieving Low-latency 19 4.2 Pagefault handling 20 4.2.1 Fast-path and slow-path in the pagefault handler 21 4.2.2 State transition of RackVM page 23 4.3 Latency Hiding Techniques 25 4.4 Implementation 26 4.4.1 RackMem Virtual Memory Module 27 4.4.2 Dynamic Rebalancing of Local Memory 29 4.4.3 RackVM for Virtual Machines 29 4.4.4 Running Unmodified Applications 30 Chapter 5. RackMem Distributed Virtual Storage 31 5.1 The distributed Storage Abstraction 32 5.2 Memory Management 33 5.2.1 Remote memory allocation 33 5.2.2 Remote memory reclamation 33 5.3 Fault Tolerance 34 5.3.1 Fault-tolerance and Write-duplication 34 5.4 Multiple Storage Support in RackMem 36 5.5 Implementation 38 5.5.1 The Remote Memory Backend 38 5.5.2 Linux Demand Paging on RackDVS 39 Chapter 6. Networking 40 6.1 Design of RackNet 40 6.2 Implementation 41 6.2.1 RPC message layout 41 6.2.2 RackNet RPC Implementation 42 Chapter 7. Instant VM Live Migration 44 7.1 Motivation 45 7.1.1 The need for a tailored live migration technique 45 7.1.2 Software Bottlenecks 46 7.1.3 Utilizing workload variability 46 7.2 Design of Instant 47 7.2.1 Instant Region Migration 47 7.3 Implementation 48 7.3.1 Extension of RackVM for Instant 49 7.3.2 Instant region migration 49 7.3.3 Pre-fetch optimizations 51 7.3.4 Downtime optimizations 51 7.3.5 QEMU modification for Instant 52 Chapter 8. Evaluation - RackMem 53 8.1 Execution Environment 54 8.2 Pagefault Handler Latency 56 8.3 Single Application Performance 57 8.3.1 Batch-oriented Applications 58 8.3.2 Internal Pagesize and Performance 59 8.3.3 Write-duplication overhead 60 8.3.4 RackDVS slab size and performance 62 8.3.5 Latency-oriented Applications 63 8.3.6 Network Bandwidth Analysis 64 8.3.7 Dynamic Local Memory Partitioning 66 8.3.8 Rack-scale Job Processing Simulation 67 Chapter 9. Evaluation - Instant VM Live Migration 69 9.1 Experimental setup 69 9.2 Target Applications 70 9.3 Comparison targets 70 9.4 Database and client setups 71 9.5 Memory disaggregation scenarios 71 9.6.1 Time-to-responsiveness 71 9.6.2 Effective Downtime 73 9.6.3 Effect of Instant optimizations 75 Chapter 10. Conclusion 77 10.1 Future Directions 78 ์š”์•ฝ 89๋ฐ•

    Improved Designs for Application Virtualization

    Get PDF
    We propose solutions for application virtualization to mitigate the performance loss in streaming and browser-based applications. For the application streaming, we propose a solution which keeps operating system components and application software at the server and streams them to the client side for execution. This architecture minimizes the components managed at the clients and improves the platform-level incompatibility. The runtime performance of application streaming is significantly reduced when the required code is not properly available on the client side. To mitigate this issue and boost the runtime performance, we propose prefetching, i.e., speculatively delivering code blocks to the clients in advance. The probability model on which our prefetch method is based may be very large. To manage such a probability model and the associated hardware resources, we perform an information gain analysis. We draw two lower bounds of the information gain brought by an attribute set required to achieve a prefetch hit rate. We organize the probability model as a look-up table: LUT). Similar to the memory hierarchy which is widely used in the computing field, we separate the single LUT into two-level, hierarchical LUTs. To separate the entries without sorting all entries, we propose an entropy-based fast LUT separation algorithm which utilizes the entropy as an indicator. Since the domain of the attribute can be much larger than the addressable space of a virtual memory system, we need an efficient way to allocate each LUT\u27s entry in a limited memory address space. Instead of using expensive CAM, we use a hash function to convert the attribute values into addresses. We propose an improved version of the Pearson hashing to reduce the collision rate with little extra complexity. Long interactive delays due to network delays are a significant drawback for the browser-based application virtualization. To address this, we propose a distributed infrastructure arrangement for browser-based application virtualization which reduces the average communication distance among servers and clients. We investigate a hand-off protocol to deal with the user mobility in the browser-based application virtualization. Analyses and simulations for information-based prefetching and for mobile applications are provided to quantify the benefits of the proposed solutions
    • โ€ฆ
    corecore