422 research outputs found
Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management
As users of big data applications expect fresh results, we witness a new breed of stream processing systems (SPS) that are designed to scale to large numbers of cloud-hosted machines. Such systems face new challenges: (i) to benefit from the pay-as-you-go model of cloud computing, they must scale out on demand, acquiring additional virtual machines (VMs) and parallelising operators when the workload increases; (ii) failures are common with deployments on hundreds of VMs - systems must be fault-tolerant with fast recovery times, yet low per-machine overheads. An open question is how to achieve these two goals when stream queries include stateful operators, which must be scaled out and recovered without affecting query results. Our key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives. Based on them, we describe an integrated approach for dynamic scale out and recovery of stateful operators. Externalised operator state is checkpointed periodically by the SPS and backed up to upstream VMs. The SPS identifies individual operator bottlenecks and automatically scales them out by allocating new VMs and partitioning the check-pointed state. At any point, failed operators are recovered by restoring checkpointed state on a new VM and replaying unprocessed tuples. We evaluate this approach with the Linear Road Benchmark on the Amazon EC2 cloud platform and show that it can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures. Copyright ยฉ 2013 ACM
Recommended from our members
Transiency-driven Resource Management for Cloud Computing Platforms
Modern distributed server applications are hosted on enterprise or cloud data centers that provide computing, storage, and networking capabilities to these applications. These applications are built using the implicit assumption that the underlying servers will be stable and normally available, barring for occasional faults. In many emerging scenarios, however, data centers and clouds only provide transient, rather than continuous, availability of their servers. Transiency in modern distributed systems arises in many contexts, such as green data centers powered using renewable intermittent sources, and cloud platforms that provide lower-cost transient servers which can be unilaterally revoked by the cloud operator.
Transient computing resources are increasingly important, and existing fault-tolerance and resource management techniques are inadequate for transient servers because applications typically assume continuous resource availability. This thesis presents research in distributed systems design that treats transiency as a first-class design principle. I show that combining transiency-specific fault-tolerance mechanisms with resource management policies to suit application characteristics and requirements, can yield significant cost and performance benefits. These mechanisms and policies have been implemented and prototyped as part of software systems, which allow a wide range of applications, such as interactive services and distributed data processing, to be deployed on transient servers, and can reduce cloud computing costs by up to 90\%.
This thesis makes contributions to four areas of computer systems research: transiency-specific fault-tolerance, resource allocation, abstractions, and resource reclamation. For reducing the impact of transient server revocations, I develop two fault-tolerance techniques that are tailored to transient server characteristics and application requirements. For interactive applications, I build a derivative cloud platform that masks revocations by transparently moving application-state between servers of different types. Similarly, for distributed data processing applications, I investigate the use of application level periodic checkpointing to reduce the performance impact of server revocations. For managing and reducing the risk of server revocations, I investigate the use of server portfolios that allow transient resource allocation to be tailored to application requirements.
Finally, I investigate how resource providers (such as cloud platforms) can provide transient resource availability without revocation, by looking into alternative resource reclamation techniques. I develop resource deflation, wherein a server\u27s resources are fractionally reclaimed, allowing the application to continue execution albeit with fewer resources. Resource deflation generalizes revocation, and the deflation mechanisms and cluster-wide policies can yield both high cluster utilization and low application performance degradation
MISO: Exploiting Multi-Instance GPU Capability on Multi-Tenant Systems for Machine Learning
GPU technology has been improving at an expedited pace in terms of size and
performance, empowering HPC and AI/ML researchers to advance the scientific
discovery process. However, this also leads to inefficient resource usage, as
most GPU workloads, including complicated AI/ML models, are not able to utilize
the GPU resources to their fullest extent -- encouraging support for GPU
multi-tenancy. We propose MISO, a technique to exploit the Multi-Instance GPU
(MIG) capability on the latest NVIDIA datacenter GPUs (e.g., A100, H100) to
dynamically partition GPU resources among co-located jobs. MISO's key insight
is to use the lightweight, more flexible Multi-Process Service (MPS) capability
to predict the best MIG partition allocation for different jobs, without
incurring the overhead of implementing them during exploration. Due to its
ability to utilize GPU resources more efficiently, MISO achieves 49% and 16%
lower average job completion time than the unpartitioned and optimal static GPU
partition schemes, respectively
๋ฐ์ดํฐ ์ง์ฝ์ ์์ฉ์ ํจ์จ์ ์ธ ์์คํ ์์ ํ์ฉ์ ์ํ ๋ฉ๋ชจ๋ฆฌ ์๋ธ์์คํ ์ต์ ํ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ปดํจํฐ๊ณตํ๋ถ, 2020. 8. ์ผํ์.With explosive data growth, data-intensive applications, such as relational database and key-value storage, have been increasingly popular in a variety of domains in recent years. To meet the growing performance demands of data-intensive applications, it is crucial to efficiently and fully utilize memory resources for the best possible performance.
However, general-purpose operating systems (OSs) are designed to provide system resources to applications running on a system in a fair manner at system-level. A single application may find it difficult to fully exploit the systems best performance due to this system-level fairness. For performance reasons, many data-intensive applications implement their own mechanisms that OSs already provide, under the assumption that they know better about the data than OSs. They can be greedily optimized for performance but this may result in inefficient use of system resources.
In this dissertation, we claim that simple OS support with minor application modifications can yield even higher application performance without sacrificing system-level resource utilization. We optimize and extend OS memory subsystem for better supporting applications while addressing three memory-related issues in data-intensive applications. First, we introduce a memory-efficient cooperative caching approach between application and kernel buffer to address double caching problem where the same data resides in multiple layers. Second, we present a memory-efficient, transparent zero-copy read I/O scheme to avoid the performance interference problem caused by memory copy behavior during I/O. Third, we propose a memory-efficient fork-based checkpointing mechanism for in-memory database systems to mitigate the memory footprint problem of the existing fork-based checkpointing scheme; memory usage increases incrementally (up to 2x) during checkpointing for update-intensive workloads.
To show the effectiveness of our approach, we implement and evaluate our schemes on real multi-core systems. The experimental results demonstrate that our cooperative approach can more effectively address the above issues related to data-intensive applications than existing non-cooperative approaches while delivering better performance (in terms of transaction processing speed, I/O throughput, or memory footprint).์ต๊ทผ ํญ๋ฐ์ ์ธ ๋ฐ์ดํฐ ์ฑ์ฅ๊ณผ ๋๋ถ์ด ๋ฐ์ดํฐ๋ฒ ์ด์ค, ํค-๋ฐธ๋ฅ ์คํ ๋ฆฌ์ง ๋ฑ์ ๋ฐ์ดํฐ ์ง์ฝ์ ์ธ ์์ฉ๋ค์ด ๋ค์ํ ๋๋ฉ์ธ์์ ์ธ๊ธฐ๋ฅผ ์ป๊ณ ์๋ค. ๋ฐ์ดํฐ ์ง์ฝ์ ์ธ ์์ฉ์ ๋์ ์ฑ๋ฅ ์๊ตฌ๋ฅผ ์ถฉ์กฑํ๊ธฐ ์ํด์๋ ์ฃผ์ด์ง ๋ฉ๋ชจ๋ฆฌ ์์์ ํจ์จ์ ์ด๊ณ ์๋ฒฝํ๊ฒ ํ์ฉํ๋ ๊ฒ์ด ์ค์ํ๋ค. ๊ทธ๋ฌ๋, ๋ฒ์ฉ ์ด์์ฒด์ (OS)๋ ์์คํ
์์ ์ํ ์ค์ธ ๋ชจ๋ ์์ฉ๋ค์ ๋ํด ์์คํ
์ฐจ์์์ ๊ณตํํ๊ฒ ์์์ ์ ๊ณตํ๋ ๊ฒ์ ์ฐ์ ํ๋๋ก ์ค๊ณ๋์ด์๋ค. ์ฆ, ์์คํ
์ฐจ์์ ๊ณตํ์ฑ ์ ์ง๋ฅผ ์ํ ์ด์์ฒด์ ์ง์์ ํ๊ณ๋ก ์ธํด ๋จ์ผ ์์ฉ์ ์์คํ
์ ์ต๊ณ ์ฑ๋ฅ์ ์์ ํ ํ์ฉํ๊ธฐ ์ด๋ ต๋ค. ์ด๋ฌํ ์ด์ ๋ก, ๋ง์ ๋ฐ์ดํฐ ์ง์ฝ์ ์์ฉ์ ์ด์์ฒด์ ์์ ์ ๊ณตํ๋ ๊ธฐ๋ฅ์ ์์งํ์ง ์๊ณ ๋น์ทํ ๊ธฐ๋ฅ์ ์์ฉ ๋ ๋ฒจ์ ๊ตฌํํ๊ณค ํ๋ค. ์ด๋ฌํ ์ ๊ทผ ๋ฐฉ๋ฒ์ ํ์์ ์ธ ์ต์ ํ๊ฐ ๊ฐ๋ฅํ๋ค๋ ์ ์์ ์ฑ๋ฅ ์ ์ด๋์ด ์์ ์ ์์ง๋ง, ์์คํ
์์์ ๋นํจ์จ์ ์ธ ์ฌ์ฉ์ ์ด๋ํ ์ ์๋ค.
๋ณธ ๋
ผ๋ฌธ์์๋ ์ด์์ฒด์ ์ ์ง์๊ณผ ์ฝ๊ฐ์ ์์ฉ ์์ ๋ง์ผ๋ก๋ ๋นํจ์จ์ ์ธ ์์คํ
์์ ์ฌ์ฉ ์์ด ๋ณด๋ค ๋์ ์์ฉ ์ฑ๋ฅ์ ๋ณด์ผ ์ ์์์ ์ฆ๋ช
ํ๊ณ ์ ํ๋ค. ๊ทธ๋ฌ๊ธฐ ์ํด ์ด์์ฒด์ ์ ๋ฉ๋ชจ๋ฆฌ ์๋ธ์์คํ
์ ์ต์ ํ ๋ฐ ํ์ฅํ์ฌ ๋ฐ์ดํฐ ์ง์ฝ์ ์ธ ์์ฉ์์ ๋ฐ์ํ๋ ์ธ ๊ฐ์ง ๋ฉ๋ชจ๋ฆฌ ๊ด๋ จ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ์๋ค. ์ฒซ์งธ, ๋์ผํ ๋ฐ์ดํฐ๊ฐ ์ฌ๋ฌ ๊ณ์ธต์ ์กด์ฌํ๋ ์ค๋ณต ์บ์ฑ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ์์ฉ๊ณผ ์ปค๋ ๋ฒํผ ๊ฐ์ ๋ฉ๋ชจ๋ฆฌ ํจ์จ์ ์ธ ํ๋ ฅ ์บ์ฑ ๋ฐฉ์์ ์ ์ํ์๋ค. ๋์งธ, ์
์ถ๋ ฅ์ ๋ฐ์ํ๋ ๋ฉ๋ชจ๋ฆฌ ๋ณต์ฌ๋ก ์ธํ ์ฑ๋ฅ ๊ฐ์ญ ๋ฌธ์ ๋ฅผ ํผํ๊ธฐ ์ํด ๋ฉ๋ชจ๋ฆฌ ํจ์จ์ ์ธ ๋ฌด๋ณต์ฌ ์ฝ๊ธฐ ์
์ถ๋ ฅ ๋ฐฉ์์ ์ ์ํ์๋ค. ์
์งธ, ์ธ-๋ฉ๋ชจ๋ฆฌ ๋ฐ์ดํฐ๋ฒ ์ด์ค ์์คํ
์ ์ํ ๋ฉ๋ชจ๋ฆฌ ํจ์จ์ ์ธ fork ๊ธฐ๋ฐ ์ฒดํฌํฌ์ธํธ ๊ธฐ๋ฒ์ ์ ์ํ์ฌ ๊ธฐ์กด ํฌํฌ ๊ธฐ๋ฐ ์ฒดํฌํฌ์ธํธ ๊ธฐ๋ฒ์์ ๋ฐ์ํ๋ ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ๋ ์ฆ๊ฐ ๋ฌธ์ ๋ฅผ ์ํํ์๋ค; ๊ธฐ์กด ๋ฐฉ์์ ์
๋ฐ์ดํธ ์ง์ฝ์ ์ํฌ๋ก๋์ ๋ํด ์ฒดํฌํฌ์ธํ
์ ์ํํ๋ ๋์ ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ๋์ด ์ต๋ 2๋ฐฐ๊น์ง ์ ์ง์ ์ผ๋ก ์ฆ๊ฐํ ์ ์์๋ค.
๋ณธ ๋
ผ๋ฌธ์์๋ ์ ์ํ ๋ฐฉ๋ฒ๋ค์ ํจ๊ณผ๋ฅผ ์ฆ๋ช
ํ๊ธฐ ์ํด ์ค์ ๋ฉํฐ ์ฝ์ด ์์คํ
์ ๊ตฌํํ๊ณ ๊ทธ ์ฑ๋ฅ์ ํ๊ฐํ์๋ค. ์คํ๊ฒฐ๊ณผ๋ฅผ ํตํด ์ ์ํ ํ๋ ฅ์ ์ ๊ทผ๋ฐฉ์์ด ๊ธฐ์กด์ ๋นํ๋ ฅ์ ์ ๊ทผ๋ฐฉ์๋ณด๋ค ๋ฐ์ดํฐ ์ง์ฝ์ ์์ฉ์๊ฒ ํจ์จ์ ์ธ ๋ฉ๋ชจ๋ฆฌ ์์ ํ์ฉ์
๊ฐ๋ฅํ๊ฒ ํจ์ผ๋ก์จ ๋ ๋์ ์ฑ๋ฅ์ ์ ๊ณตํ ์ ์์์ ํ์ธํ ์ ์์๋ค.Chapter 1 Introduction 1
1.1 Motivation 1
1.1.1 Importance of Memory Resources 1
1.1.2 Problems 2
1.2 Contributions 5
1.3 Outline 6
Chapter 2 Background 7
2.1 Linux Kernel Memory Management 7
2.1.1 Page Cache 7
2.1.2 Page Reclamation 8
2.1.3 Page Table and TLB Shootdown 9
2.1.4 Copy-on-Write 10
2.2 Linux Support for Applications 11
2.2.1 fork 11
2.2.2 madvise 11
2.2.3 Direct I/O 12
2.2.4 mmap 13
Chapter 3 Memory Efficient Cooperative Caching 14
3.1 Motivation 14
3.1.1 Problems of Existing Datastore Architecture 14
3.1.2 Proposed Architecture 17
3.2 Related Work 17
3.3 Design and Implementation 19
3.3.1 Overview 19
3.3.2 Kernel Support 24
3.3.3 Migration to DBIO 25
3.4 Evaluation 27
3.4.1 System Configuration 27
3.4.2 Methodology 28
3.4.3 TPC-C Benchmarks 30
3.4.4 YCSB Benchmarks 32
3.5 Summary 37
Chapter 4 Memory Efficient Zero-copy I/O 38
4.1 Motivation 38
4.1.1 The Problems of Copy-Based I/O 38
4.2 Related Work 40
4.2.1 Zero Copy I/O 40
4.2.2 TLB Shootdown 42
4.2.3 Copy-on-Write 43
4.3 Design and Implementation 44
4.3.1 Prerequisites for z-READ 44
4.3.2 Overview of z-READ 45
4.3.3 TLB Shootdown Optimization 48
4.3.4 Copy-on-Write Optimization 52
4.3.5 Implementation 55
4.4 Evaluation 55
4.4.1 System Configurations 56
4.4.2 Effectiveness of the TLB Shootdown Optimization 57
4.4.3 Effectiveness of CoW Optimization 59
4.4.4 Analysis of the Performance Improvement 62
4.4.5 Performance Interference Intensity 63
4.4.6 Effectiveness of z-READ in Macrobenchmarks 65
4.5 Summary 67
Chapter 5 Memory Efficient Fork-based Checkpointing 69
5.1 Motivation 69
5.1.1 Fork-based Checkpointing 69
5.1.2 Approach 71
5.2 Related Work 73
5.3 Design and Implementation 74
5.3.1 Overview 74
5.3.2 OS Support 78
5.3.3 Implementation 79
5.4 Evaluation 80
5.4.1 Experimental Setup 80
5.4.2 Performance 81
5.5 Summary 86
Chapter 6 Conclusion 87
์์ฝ 100Docto
Mitosis based speculative multithreaded architectures
In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream,
with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread
applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahlโs law.
Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main
directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on
both these approaches, the proposed techniques so far have shown marginal performance improvements.
In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices).
Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application
- โฆ