422 research outputs found

    Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management

    Get PDF
    As users of big data applications expect fresh results, we witness a new breed of stream processing systems (SPS) that are designed to scale to large numbers of cloud-hosted machines. Such systems face new challenges: (i) to benefit from the pay-as-you-go model of cloud computing, they must scale out on demand, acquiring additional virtual machines (VMs) and parallelising operators when the workload increases; (ii) failures are common with deployments on hundreds of VMs - systems must be fault-tolerant with fast recovery times, yet low per-machine overheads. An open question is how to achieve these two goals when stream queries include stateful operators, which must be scaled out and recovered without affecting query results. Our key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives. Based on them, we describe an integrated approach for dynamic scale out and recovery of stateful operators. Externalised operator state is checkpointed periodically by the SPS and backed up to upstream VMs. The SPS identifies individual operator bottlenecks and automatically scales them out by allocating new VMs and partitioning the check-pointed state. At any point, failed operators are recovered by restoring checkpointed state on a new VM and replaying unprocessed tuples. We evaluate this approach with the Linear Road Benchmark on the Amazon EC2 cloud platform and show that it can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures. Copyright ยฉ 2013 ACM

    MISO: Exploiting Multi-Instance GPU Capability on Multi-Tenant Systems for Machine Learning

    Full text link
    GPU technology has been improving at an expedited pace in terms of size and performance, empowering HPC and AI/ML researchers to advance the scientific discovery process. However, this also leads to inefficient resource usage, as most GPU workloads, including complicated AI/ML models, are not able to utilize the GPU resources to their fullest extent -- encouraging support for GPU multi-tenancy. We propose MISO, a technique to exploit the Multi-Instance GPU (MIG) capability on the latest NVIDIA datacenter GPUs (e.g., A100, H100) to dynamically partition GPU resources among co-located jobs. MISO's key insight is to use the lightweight, more flexible Multi-Process Service (MPS) capability to predict the best MIG partition allocation for different jobs, without incurring the overhead of implementing them during exploration. Due to its ability to utilize GPU resources more efficiently, MISO achieves 49% and 16% lower average job completion time than the unpartitioned and optimal static GPU partition schemes, respectively

    ๋ฐ์ดํ„ฐ ์ง‘์•ฝ์  ์‘์šฉ์˜ ํšจ์œจ์ ์ธ ์‹œ์Šคํ…œ ์ž์› ํ™œ์šฉ์„ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์„œ๋ธŒ์‹œ์Šคํ…œ ์ตœ์ ํ™”

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2020. 8. ์—ผํ—Œ์˜.With explosive data growth, data-intensive applications, such as relational database and key-value storage, have been increasingly popular in a variety of domains in recent years. To meet the growing performance demands of data-intensive applications, it is crucial to efficiently and fully utilize memory resources for the best possible performance. However, general-purpose operating systems (OSs) are designed to provide system resources to applications running on a system in a fair manner at system-level. A single application may find it difficult to fully exploit the systems best performance due to this system-level fairness. For performance reasons, many data-intensive applications implement their own mechanisms that OSs already provide, under the assumption that they know better about the data than OSs. They can be greedily optimized for performance but this may result in inefficient use of system resources. In this dissertation, we claim that simple OS support with minor application modifications can yield even higher application performance without sacrificing system-level resource utilization. We optimize and extend OS memory subsystem for better supporting applications while addressing three memory-related issues in data-intensive applications. First, we introduce a memory-efficient cooperative caching approach between application and kernel buffer to address double caching problem where the same data resides in multiple layers. Second, we present a memory-efficient, transparent zero-copy read I/O scheme to avoid the performance interference problem caused by memory copy behavior during I/O. Third, we propose a memory-efficient fork-based checkpointing mechanism for in-memory database systems to mitigate the memory footprint problem of the existing fork-based checkpointing scheme; memory usage increases incrementally (up to 2x) during checkpointing for update-intensive workloads. To show the effectiveness of our approach, we implement and evaluate our schemes on real multi-core systems. The experimental results demonstrate that our cooperative approach can more effectively address the above issues related to data-intensive applications than existing non-cooperative approaches while delivering better performance (in terms of transaction processing speed, I/O throughput, or memory footprint).์ตœ๊ทผ ํญ๋ฐœ์ ์ธ ๋ฐ์ดํ„ฐ ์„ฑ์žฅ๊ณผ ๋”๋ถˆ์–ด ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค, ํ‚ค-๋ฐธ๋ฅ˜ ์Šคํ† ๋ฆฌ์ง€ ๋“ฑ์˜ ๋ฐ์ดํ„ฐ ์ง‘์•ฝ์ ์ธ ์‘์šฉ๋“ค์ด ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์—์„œ ์ธ๊ธฐ๋ฅผ ์–ป๊ณ  ์žˆ๋‹ค. ๋ฐ์ดํ„ฐ ์ง‘์•ฝ์ ์ธ ์‘์šฉ์˜ ๋†’์€ ์„ฑ๋Šฅ ์š”๊ตฌ๋ฅผ ์ถฉ์กฑํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ฃผ์–ด์ง„ ๋ฉ”๋ชจ๋ฆฌ ์ž์›์„ ํšจ์œจ์ ์ด๊ณ  ์™„๋ฒฝํ•˜๊ฒŒ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, ๋ฒ”์šฉ ์šด์˜์ฒด์ œ(OS)๋Š” ์‹œ์Šคํ…œ์—์„œ ์ˆ˜ํ–‰ ์ค‘์ธ ๋ชจ๋“  ์‘์šฉ๋“ค์— ๋Œ€ํ•ด ์‹œ์Šคํ…œ ์ฐจ์›์—์„œ ๊ณตํ‰ํ•˜๊ฒŒ ์ž์›์„ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์„ ์šฐ์„ ํ•˜๋„๋ก ์„ค๊ณ„๋˜์–ด์žˆ๋‹ค. ์ฆ‰, ์‹œ์Šคํ…œ ์ฐจ์›์˜ ๊ณตํ‰์„ฑ ์œ ์ง€๋ฅผ ์œ„ํ•œ ์šด์˜์ฒด์ œ ์ง€์›์˜ ํ•œ๊ณ„๋กœ ์ธํ•ด ๋‹จ์ผ ์‘์šฉ์€ ์‹œ์Šคํ…œ์˜ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ์™„์ „ํžˆ ํ™œ์šฉํ•˜๊ธฐ ์–ด๋ ต๋‹ค. ์ด๋Ÿฌํ•œ ์ด์œ ๋กœ, ๋งŽ์€ ๋ฐ์ดํ„ฐ ์ง‘์•ฝ์  ์‘์šฉ์€ ์šด์˜์ฒด์ œ์—์„œ ์ œ๊ณตํ•˜๋Š” ๊ธฐ๋Šฅ์— ์˜์ง€ํ•˜์ง€ ์•Š๊ณ  ๋น„์Šทํ•œ ๊ธฐ๋Šฅ์„ ์‘์šฉ ๋ ˆ๋ฒจ์— ๊ตฌํ˜„ํ•˜๊ณค ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ์ ‘๊ทผ ๋ฐฉ๋ฒ•์€ ํƒ์š•์ ์ธ ์ตœ์ ํ™”๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์ ์—์„œ ์„ฑ๋Šฅ ์ƒ ์ด๋“์ด ์žˆ์„ ์ˆ˜ ์žˆ์ง€๋งŒ, ์‹œ์Šคํ…œ ์ž์›์˜ ๋น„ํšจ์œจ์ ์ธ ์‚ฌ์šฉ์„ ์ดˆ๋ž˜ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์šด์˜์ฒด์ œ์˜ ์ง€์›๊ณผ ์•ฝ๊ฐ„์˜ ์‘์šฉ ์ˆ˜์ •๋งŒ์œผ๋กœ๋„ ๋น„ํšจ์œจ์ ์ธ ์‹œ์Šคํ…œ ์ž์› ์‚ฌ์šฉ ์—†์ด ๋ณด๋‹ค ๋†’์€ ์‘์šฉ ์„ฑ๋Šฅ์„ ๋ณด์ผ ์ˆ˜ ์žˆ์Œ์„ ์ฆ๋ช…ํ•˜๊ณ ์ž ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๊ธฐ ์œ„ํ•ด ์šด์˜์ฒด์ œ์˜ ๋ฉ”๋ชจ๋ฆฌ ์„œ๋ธŒ์‹œ์Šคํ…œ์„ ์ตœ์ ํ™” ๋ฐ ํ™•์žฅํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์ง‘์•ฝ์ ์ธ ์‘์šฉ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์„ธ ๊ฐ€์ง€ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ จ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์˜€๋‹ค. ์ฒซ์งธ, ๋™์ผํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ์—ฌ๋Ÿฌ ๊ณ„์ธต์— ์กด์žฌํ•˜๋Š” ์ค‘๋ณต ์บ์‹ฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์‘์šฉ๊ณผ ์ปค๋„ ๋ฒ„ํผ ๊ฐ„์— ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ ์ธ ํ˜‘๋ ฅ ์บ์‹ฑ ๋ฐฉ์‹์„ ์ œ์‹œํ•˜์˜€๋‹ค. ๋‘˜์งธ, ์ž…์ถœ๋ ฅ์‹œ ๋ฐœ์ƒํ•˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ๋ณต์‚ฌ๋กœ ์ธํ•œ ์„ฑ๋Šฅ ๊ฐ„์„ญ ๋ฌธ์ œ๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ ์ธ ๋ฌด๋ณต์‚ฌ ์ฝ๊ธฐ ์ž…์ถœ๋ ฅ ๋ฐฉ์‹์„ ์ œ์‹œํ•˜์˜€๋‹ค. ์…‹์งธ, ์ธ-๋ฉ”๋ชจ๋ฆฌ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ์‹œ์Šคํ…œ์„ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ ์ธ fork ๊ธฐ๋ฐ˜ ์ฒดํฌํฌ์ธํŠธ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜์—ฌ ๊ธฐ์กด ํฌํฌ ๊ธฐ๋ฐ˜ ์ฒดํฌํฌ์ธํŠธ ๊ธฐ๋ฒ•์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ์ฆ๊ฐ€ ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜์˜€๋‹ค; ๊ธฐ์กด ๋ฐฉ์‹์€ ์—…๋ฐ์ดํŠธ ์ง‘์•ฝ์  ์›Œํฌ๋กœ๋“œ์— ๋Œ€ํ•ด ์ฒดํฌํฌ์ธํŒ…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋™์•ˆ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ์ตœ๋Œ€ 2๋ฐฐ๊นŒ์ง€ ์ ์ง„์ ์œผ๋กœ ์ฆ๊ฐ€ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•๋“ค์˜ ํšจ๊ณผ๋ฅผ ์ฆ๋ช…ํ•˜๊ธฐ ์œ„ํ•ด ์‹ค์ œ ๋ฉ€ํ‹ฐ ์ฝ”์–ด ์‹œ์Šคํ…œ์— ๊ตฌํ˜„ํ•˜๊ณ  ๊ทธ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜์˜€๋‹ค. ์‹คํ—˜๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด ์ œ์•ˆํ•œ ํ˜‘๋ ฅ์  ์ ‘๊ทผ๋ฐฉ์‹์ด ๊ธฐ์กด์˜ ๋น„ํ˜‘๋ ฅ์  ์ ‘๊ทผ๋ฐฉ์‹๋ณด๋‹ค ๋ฐ์ดํ„ฐ ์ง‘์•ฝ์  ์‘์šฉ์—๊ฒŒ ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ž์› ํ™œ์šฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ์œผ๋กœ์จ ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.Chapter 1 Introduction 1 1.1 Motivation 1 1.1.1 Importance of Memory Resources 1 1.1.2 Problems 2 1.2 Contributions 5 1.3 Outline 6 Chapter 2 Background 7 2.1 Linux Kernel Memory Management 7 2.1.1 Page Cache 7 2.1.2 Page Reclamation 8 2.1.3 Page Table and TLB Shootdown 9 2.1.4 Copy-on-Write 10 2.2 Linux Support for Applications 11 2.2.1 fork 11 2.2.2 madvise 11 2.2.3 Direct I/O 12 2.2.4 mmap 13 Chapter 3 Memory Efficient Cooperative Caching 14 3.1 Motivation 14 3.1.1 Problems of Existing Datastore Architecture 14 3.1.2 Proposed Architecture 17 3.2 Related Work 17 3.3 Design and Implementation 19 3.3.1 Overview 19 3.3.2 Kernel Support 24 3.3.3 Migration to DBIO 25 3.4 Evaluation 27 3.4.1 System Configuration 27 3.4.2 Methodology 28 3.4.3 TPC-C Benchmarks 30 3.4.4 YCSB Benchmarks 32 3.5 Summary 37 Chapter 4 Memory Efficient Zero-copy I/O 38 4.1 Motivation 38 4.1.1 The Problems of Copy-Based I/O 38 4.2 Related Work 40 4.2.1 Zero Copy I/O 40 4.2.2 TLB Shootdown 42 4.2.3 Copy-on-Write 43 4.3 Design and Implementation 44 4.3.1 Prerequisites for z-READ 44 4.3.2 Overview of z-READ 45 4.3.3 TLB Shootdown Optimization 48 4.3.4 Copy-on-Write Optimization 52 4.3.5 Implementation 55 4.4 Evaluation 55 4.4.1 System Configurations 56 4.4.2 Effectiveness of the TLB Shootdown Optimization 57 4.4.3 Effectiveness of CoW Optimization 59 4.4.4 Analysis of the Performance Improvement 62 4.4.5 Performance Interference Intensity 63 4.4.6 Effectiveness of z-READ in Macrobenchmarks 65 4.5 Summary 67 Chapter 5 Memory Efficient Fork-based Checkpointing 69 5.1 Motivation 69 5.1.1 Fork-based Checkpointing 69 5.1.2 Approach 71 5.2 Related Work 73 5.3 Design and Implementation 74 5.3.1 Overview 74 5.3.2 OS Support 78 5.3.3 Implementation 79 5.4 Evaluation 80 5.4.1 Experimental Setup 80 5.4.2 Performance 81 5.5 Summary 86 Chapter 6 Conclusion 87 ์š”์•ฝ 100Docto

    Mitosis based speculative multithreaded architectures

    Get PDF
    In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream, with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahlโ€™s law. Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on both these approaches, the proposed techniques so far have shown marginal performance improvements. In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices). Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application
    • โ€ฆ
    corecore