660 research outputs found
๋ถ์ฐ ๊ธฐ๊ณ ํ์ต์ ์์ ํจ์จ์ ์ธ ์ํ์ ์ํ ๋์ ์ต์ ํ ๊ธฐ์
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ)--์์ธ๋ํ๊ต ๋ํ์ :๊ณต๊ณผ๋ํ ์ปดํจํฐ๊ณตํ๋ถ,2020. 2. ์ ๋ณ๊ณค.Machine Learning(ML) systems are widely used to extract insights from data. Ever increasing dataset sizes and model complexity gave rise to many efforts towards ef๏ฌcient distributed machine learning systems. One of the popular approaches to support large scale data and complicated models is the parameter server (PS) approach. In this approach, a training job runs with distributed worker and server tasks, where workers iteratively compute gradients to update the global model parameters that are kept in servers.
To improve the PS system performance, this dissertation proposes two solutions that automatically optimize resource ef๏ฌciency and system performance. First, we propose a solution that optimizes the resource con๏ฌguration and workload partitioning of distributed ML training on PS system. To ๏ฌnd the best con๏ฌguration, we build an Optimizer based on a cost model that works with online metrics. To ef๏ฌciently apply decisions by Optimizer, we design our runtime elastic to perform recon๏ฌguration in the background with minimal overhead.
The second solution optimizes the scheduling of resources and tasks of multiple ML training jobs in a shared cluster. Speci๏ฌcally, we co-locate jobs with complementary resource use to increase resource utilization, while executing their tasks with ๏ฌne-grained unit to avoid resource contention. To alleviate memory pressure by co-located jobs, we enable dynamic spill/reload of data, which adaptively changes the ratio of data between disk and memory.
We build a working system that implements our approaches. The above two solutions are implemented in the same system and share the runtime part that can dynamically migrate jobs between machines and reallocate machine resources. We evaluate our system with popular ML applications to verify the effectiveness of our solutions.๊ธฐ๊ณ ํ์ต ์์คํ
์ ๋ฐ์ดํฐ์ ์จ๊ฒจ์ง ์๋ฏธ๋ฅผ ๋ฝ์๋ด๊ธฐ ์ํด ๋๋ฆฌ ์ฌ์ฉ๋๊ณ ์๋ค. ๋ฐ์ดํฐ์
์ ํฌ๊ธฐ์ ๋ชจ๋ธ์ ๋ณต์ก๋๊ฐ ์ด๋๋๋ณด๋ค ์ปค์ง์ ๋ฐ๋ผ ํจ์จ์ ์ธ ๋ถ์ฐ ๊ธฐ๊ณ ํ์ต ์์คํ
์์ํ ๋ง์ ๋
ธ๋ ฅ๋ค์ด ์ด๋ฃจ์ด์ง๊ณ ์๋ค. ํ๋ผ๋ฏธํฐ ์๋ฒ ๋ฐฉ์์ ๊ฑฐ๋ํ ์ค์ผ์ผ์ ๋ฐ์ดํฐ์ ๋ณต์กํ ๋ชจ๋ธ์ ์ง์ํ๊ธฐ ์ํ ์ ๋ช
ํ ๋ฐฉ๋ฒ๋ค ์ค ํ๋์ด๋ค. ์ด ๋ฐฉ์์์, ํ์ต ์์
์ ๋ถ์ฐ ์์ปค์ ์๋ฒ๋ค๋ก ๊ตฌ์ฑ๋๊ณ , ์์ปค๋ค์ ํ ๋น๋ ์
๋ ฅ ๋ฐ์ดํฐ๋ก๋ถํฐ ๋ฐ๋ณต์ ์ผ๋ก ๊ทธ๋ ๋์ธํธ๋ฅผ ๊ณ์ฐํ์ฌ ์๋ฒ๋ค์ ๋ณด๊ด๋ ๊ธ๋ก๋ฒ ๋ชจ๋ธ ํ ๋ผ๋ฏธํฐ๋ค์ ์
๋ฐ์ดํธํ๋ค.
ํ๋ผ๋ฏธํฐ ์๋ฒ ์์คํ
์ ์ฑ๋ฅ์ ํฅ์์ํค๊ธฐ ์ํด, ์ด ๋
ผ๋ฌธ์์๋ ์๋์ ์ผ๋ก ์์ ํจ์จ์ฑ๊ณผ ์์คํ
์ฑ๋ฅ์ ์ต์ ํํ๋ ๋๊ฐ์ง์ ํด๋ฒ์ ์ ์ํ๋ค. ์ฒซ๋ฒ์งธ ํด๋ฒ์, ํ๋ผ๋ฏธํฐ ์์คํ
์์ ๋ถ์ฐ ๊ธฐ๊ณ ํ์ต์ ์ํ์์ ์์ ์ค์ ๋ฐ ์ํฌ๋ก๋ ๋ถ๋ฐฐ๋ฅผ ์๋ํํ๋ ๊ฒ์ด๋ค. ์ต๊ณ ์ ์ค์ ์ ์ฐพ๊ธฐ ์ํด ์ฐ๋ฆฌ๋ ์จ๋ผ์ธ ๋ฉํธ๋ฆญ์ ์ฌ์ฉํ๋ ๋น์ฉ ๋ชจ๋ธ์ ๊ธฐ๋ฐ์ผ๋ก ํ๋ Optimizer๋ฅผ ๋ง๋ค์๋ค. Optimizer์ ๊ฒฐ์ ์ ํจ์จ์ ์ผ๋ก ์ ์ฉํ๊ธฐ ์ํด, ์ฐ๋ฆฌ๋ ๋ฐํ์์ ๋์ ์ฌ์ค์ ์ ์ต์์ ์ค๋ฒํค๋๋ก ๋ฐฑ๊ทธ๋ผ์ด๋์์ ์ํํ๋๋ก ๋์์ธํ๋ค.
๋๋ฒ์งธ ํด๋ฒ์ ๊ณต์ ํด๋ฌ์คํฐ ์ํฉ์์ ์ฌ๋ฌ ๊ฐ์ ๊ธฐ๊ณ ํ์ต ์์
์ ์ธ๋ถ ์์
๊ณผ ์์์ ์ค์ผ์ฅด๋ง์ ์ต์ ํํ ๊ฒ์ด๋ค. ๊ตฌ์ฒด์ ์ผ๋ก, ์ฐ๋ฆฌ๋ ์ธ๋ถ ์์
๋ค์ ์ธ๋ฐํ ๋จ์๋ก ์ํํจ์ผ๋ก์จ ์์ ๊ฒฝ์์ ์ต์ ํ๊ณ , ์๋ก๋ฅผ ๋ณด์ํ๋ ์์ ์ฌ์ฉ ํจํด์ ๋ณด์ด๋ ์์
๋ค์ ๊ฐ์ ์์์ ํจ๊ป ์์น์์ผ ์์ ํ์ฉ์จ์ ๋์ด์ฌ๋ ธ๋ค. ํจ๊ป ์์นํ ์์
๋ค์ ๋ฉ๋ชจ๋ฆฌ ์๋ ฅ์ ๊ฒฝ๊ฐ์ํค๊ธฐ ์ํด ์ฐ๋ฆฌ๋ ๋์ ์ผ๋ก ๋ฐ์ดํฐ๋ฅผ ๋์คํฌ๋ก ๋ด๋ ธ๋ค๊ฐ ๋ค์ ๋ฉ๋ชจ๋ฆฌ๋ก ์ฝ์ด์ค๋ ๊ธฐ๋ฅ์ ์ง์ํจ๊ณผ ๋์์, ๋์คํฌ์ ๋ฉ๋ชจ๋ฆฌ๊ฐ์ ๋ฐ์ดํฐ ๋น์จ์ ์ํฉ์ ๋ง๊ฒ ์์คํ
์ด ์๋์ผ๋ก ๋ง์ถ๋๋ก ํ์๋ค.
์์ ํด๋ฒ๋ค์ ์ค์ฒดํํ๊ธฐ ์ํด, ์ค์ ๋์ํ๋ ์์คํ
์ ๋ง๋ค์๋ค. ๋๊ฐ์ง์ ํด๋ฒ์ ํ๋์ ์์คํ
์ ๊ตฌํํจ์ผ๋ก์จ, ๋์ ์ผ๋ก ์์
์ ๋จธ์ ๊ฐ์ ์ฎ๊ธฐ๊ณ ์์์ ์ฌํ ๋นํ ์ ์๋ ๋ฐํ์์ ๊ณต์ ํ๋ค. ํด๋น ์๋ฃจ์
๋ค์ ํจ๊ณผ๋ฅผ ๋ณด์ฌ์ฃผ๊ธฐ ์ํด, ์ด ์์คํ
์ ๋ง์ด ์ฌ์ฉ๋๋ ๊ธฐ๊ณ ํ์ต ์ดํ๋ฆฌ์ผ์ด์
์ผ๋ก ์คํํ์๊ณ ๊ธฐ์กด ์์คํ
๋ค ๋๋น ๋ฐ์ด๋ ์ฑ๋ฅ ํฅ์์ ๋ณด์ฌ์ฃผ์๋ค.Chapter1. Introduction 1
1.1 Distributed Machine Learning on Parameter Servers 1
1.2 Automating System Conguration of Distributed Machine Learning 2
1.3 Scheduling of Multiple Distributed Machine Learning Jobs 3
1.4 Contributions 5
1.5 Dissertation Structure 6
Chapter2. Background 7
Chapter3. Automating System Conguration of Distributed Machine Learning 10
3.1 System Conguration Challenges 11
3.2 Finding Good System Conguration 13
3.2.1 Cost Model 13
3.2.2 Cost Formulation 15
3.2.3 Optimization 16
3.3 Cruise 18
3.3.1 Optimizer 19
3.3.2 Elastic Runtime 21
3.4 Evaluation 26
3.4.1 Experimental Setup 26
3.4.2 Finding Baselines with Grid Search 28
3.4.3 Optimization in the Homogeneous Environment 28
3.4.4 Utilizing Opportunistic Resources 30
3.4.5 Optimization in the Heterogeneous Environment 31
3.4.6 Reconguration Speed 32
3.5 Related Work 33
3.6 Summary 34
Chapter4 A Scheduling Framework Optimized for Multiple Distributed Machine Learning Jobs 36
4.1 Resource Under-utilization Problems in PS ML Training 37
4.2 Harmony Overview 42
4.3 Multiplexing ML Jobs 43
4.3.1 Fine-grained Execution with Subtasks 44
4.3.2 Dynamic Grouping of Jobs 45
4.3.3 Dynamic Data Reloading 54
4.4 Evaluation 56
4.4.1 Baselines 56
4.4.2 Experimental Setup 57
4.4.3 Performance Comparison 59
4.4.4 Performance Breakdown 59
4.4.5 Workload Sensitivity Analysis 61
4.4.6 Accuracy of the Performance Model 63
4.4.7 Performance and Scalability of the Scheduling Algorithm 64
4.4.8 Dynamic Data Reloading 66
4.5 Discussion 67
4.6 Related Work 67
4.7 Summary 70
Chapter5 Conclusion 71
5.1 Summary 71
5.2 Future Work 71
5.2.1 Other Communication Architecture Support 71
5.2.2 Deep Learning & GPU Resource Support 72
์์ฝ 81Docto
Data-Driven Intelligent Scheduling For Long Running Workloads In Large-Scale Datacenters
Cloud computing is becoming a fundamental facility of society today. Large-scale public or private cloud datacenters spreading millions of servers, as a warehouse-scale computer, are supporting most business of Fortune-500 companies and serving billions of users around the world. Unfortunately, modern industry-wide average datacenter utilization is as low as 6% to 12%. Low utilization not only negatively impacts operational and capital components of cost efficiency, but also becomes the scaling bottleneck due to the limits of electricity delivered by nearby utility. It is critical and challenge to improve multi-resource efficiency for global datacenters.
Additionally, with the great commercial success of diverse big data analytics services, enterprise datacenters are evolving to host heterogeneous computation workloads including online web services, batch processing, machine learning, streaming computing, interactive query and graph computation on shared clusters. Most of them are long-running workloads that leverage long-lived containers to execute tasks.
We concluded datacenter resource scheduling works over last 15 years. Most previous works are designed to maximize the cluster efficiency for short-lived tasks in batch processing system like Hadoop. They are not suitable for modern long-running workloads of Microservices, Spark, Flink, Pregel, Storm or Tensorflow like systems. It is urgent to develop new effective scheduling and resource allocation approaches to improve efficiency in large-scale enterprise datacenters.
In the dissertation, we are the first of works to define and identify the problems, challenges and scenarios of scheduling and resource management for diverse long-running workloads in modern datacenter. They rely on predictive scheduling techniques to perform reservation, auto-scaling, migration or rescheduling. It forces us to pursue and explore more intelligent scheduling techniques by adequate predictive knowledges. We innovatively specify what is intelligent scheduling, what abilities are necessary towards intelligent scheduling, how to leverage intelligent scheduling to transfer NP-hard online scheduling problems to resolvable offline scheduling issues.
We designed and implemented an intelligent cloud datacenter scheduler, which automatically performs resource-to-performance modeling, predictive optimal reservation estimation, QoS (interference)-aware predictive scheduling to maximize resource efficiency of multi-dimensions (CPU, Memory, Network, Disk I/O), and strictly guarantee service level agreements (SLA) for long-running workloads.
Finally, we introduced a large-scale co-location techniques of executing long-running and other workloads on the shared global datacenter infrastructure of Alibaba Group. It effectively improves cluster utilization from 10% to averagely 50%. It is far more complicated beyond scheduling that involves technique evolutions of IDC, network, physical datacenter topology, storage, server hardwares, operating systems and containerization. We demonstrate its effectiveness by analysis of newest Alibaba public cluster trace in 2017. We are the first of works to reveal the global view of scenarios, challenges and status in Alibaba large-scale global datacenters by data demonstration, including big promotion events like Double 11 .
Data-driven intelligent scheduling methodologies and effective infrastructure co-location techniques are critical and necessary to pursue maximized multi-resource efficiency in modern large-scale datacenter, especially for long-running workloads
Resource Management and Scheduling for Big Data Applications in Cloud Computing Environments
This chapter presents software architectures of the big data processing
platforms. It will provide an in-depth knowledge on resource management
techniques involved while deploying big data processing systems on cloud
environment. It starts from the very basics and gradually introduce the core
components of resource management which we have divided in multiple layers. It
covers the state-of-art practices and researches done in SLA-based resource
management with a specific focus on the job scheduling mechanisms.Comment: 27 pages, 9 figure
- โฆ