349,803 research outputs found
๋ถ์ฐ ๊ธฐ๊ณ ํ์ต์ ์์ ํจ์จ์ ์ธ ์ํ์ ์ํ ๋์ ์ต์ ํ ๊ธฐ์
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ)--์์ธ๋ํ๊ต ๋ํ์ :๊ณต๊ณผ๋ํ ์ปดํจํฐ๊ณตํ๋ถ,2020. 2. ์ ๋ณ๊ณค.Machine Learning(ML) systems are widely used to extract insights from data. Ever increasing dataset sizes and model complexity gave rise to many efforts towards ef๏ฌcient distributed machine learning systems. One of the popular approaches to support large scale data and complicated models is the parameter server (PS) approach. In this approach, a training job runs with distributed worker and server tasks, where workers iteratively compute gradients to update the global model parameters that are kept in servers.
To improve the PS system performance, this dissertation proposes two solutions that automatically optimize resource ef๏ฌciency and system performance. First, we propose a solution that optimizes the resource con๏ฌguration and workload partitioning of distributed ML training on PS system. To ๏ฌnd the best con๏ฌguration, we build an Optimizer based on a cost model that works with online metrics. To ef๏ฌciently apply decisions by Optimizer, we design our runtime elastic to perform recon๏ฌguration in the background with minimal overhead.
The second solution optimizes the scheduling of resources and tasks of multiple ML training jobs in a shared cluster. Speci๏ฌcally, we co-locate jobs with complementary resource use to increase resource utilization, while executing their tasks with ๏ฌne-grained unit to avoid resource contention. To alleviate memory pressure by co-located jobs, we enable dynamic spill/reload of data, which adaptively changes the ratio of data between disk and memory.
We build a working system that implements our approaches. The above two solutions are implemented in the same system and share the runtime part that can dynamically migrate jobs between machines and reallocate machine resources. We evaluate our system with popular ML applications to verify the effectiveness of our solutions.๊ธฐ๊ณ ํ์ต ์์คํ
์ ๋ฐ์ดํฐ์ ์จ๊ฒจ์ง ์๋ฏธ๋ฅผ ๋ฝ์๋ด๊ธฐ ์ํด ๋๋ฆฌ ์ฌ์ฉ๋๊ณ ์๋ค. ๋ฐ์ดํฐ์
์ ํฌ๊ธฐ์ ๋ชจ๋ธ์ ๋ณต์ก๋๊ฐ ์ด๋๋๋ณด๋ค ์ปค์ง์ ๋ฐ๋ผ ํจ์จ์ ์ธ ๋ถ์ฐ ๊ธฐ๊ณ ํ์ต ์์คํ
์์ํ ๋ง์ ๋
ธ๋ ฅ๋ค์ด ์ด๋ฃจ์ด์ง๊ณ ์๋ค. ํ๋ผ๋ฏธํฐ ์๋ฒ ๋ฐฉ์์ ๊ฑฐ๋ํ ์ค์ผ์ผ์ ๋ฐ์ดํฐ์ ๋ณต์กํ ๋ชจ๋ธ์ ์ง์ํ๊ธฐ ์ํ ์ ๋ช
ํ ๋ฐฉ๋ฒ๋ค ์ค ํ๋์ด๋ค. ์ด ๋ฐฉ์์์, ํ์ต ์์
์ ๋ถ์ฐ ์์ปค์ ์๋ฒ๋ค๋ก ๊ตฌ์ฑ๋๊ณ , ์์ปค๋ค์ ํ ๋น๋ ์
๋ ฅ ๋ฐ์ดํฐ๋ก๋ถํฐ ๋ฐ๋ณต์ ์ผ๋ก ๊ทธ๋ ๋์ธํธ๋ฅผ ๊ณ์ฐํ์ฌ ์๋ฒ๋ค์ ๋ณด๊ด๋ ๊ธ๋ก๋ฒ ๋ชจ๋ธ ํ ๋ผ๋ฏธํฐ๋ค์ ์
๋ฐ์ดํธํ๋ค.
ํ๋ผ๋ฏธํฐ ์๋ฒ ์์คํ
์ ์ฑ๋ฅ์ ํฅ์์ํค๊ธฐ ์ํด, ์ด ๋
ผ๋ฌธ์์๋ ์๋์ ์ผ๋ก ์์ ํจ์จ์ฑ๊ณผ ์์คํ
์ฑ๋ฅ์ ์ต์ ํํ๋ ๋๊ฐ์ง์ ํด๋ฒ์ ์ ์ํ๋ค. ์ฒซ๋ฒ์งธ ํด๋ฒ์, ํ๋ผ๋ฏธํฐ ์์คํ
์์ ๋ถ์ฐ ๊ธฐ๊ณ ํ์ต์ ์ํ์์ ์์ ์ค์ ๋ฐ ์ํฌ๋ก๋ ๋ถ๋ฐฐ๋ฅผ ์๋ํํ๋ ๊ฒ์ด๋ค. ์ต๊ณ ์ ์ค์ ์ ์ฐพ๊ธฐ ์ํด ์ฐ๋ฆฌ๋ ์จ๋ผ์ธ ๋ฉํธ๋ฆญ์ ์ฌ์ฉํ๋ ๋น์ฉ ๋ชจ๋ธ์ ๊ธฐ๋ฐ์ผ๋ก ํ๋ Optimizer๋ฅผ ๋ง๋ค์๋ค. Optimizer์ ๊ฒฐ์ ์ ํจ์จ์ ์ผ๋ก ์ ์ฉํ๊ธฐ ์ํด, ์ฐ๋ฆฌ๋ ๋ฐํ์์ ๋์ ์ฌ์ค์ ์ ์ต์์ ์ค๋ฒํค๋๋ก ๋ฐฑ๊ทธ๋ผ์ด๋์์ ์ํํ๋๋ก ๋์์ธํ๋ค.
๋๋ฒ์งธ ํด๋ฒ์ ๊ณต์ ํด๋ฌ์คํฐ ์ํฉ์์ ์ฌ๋ฌ ๊ฐ์ ๊ธฐ๊ณ ํ์ต ์์
์ ์ธ๋ถ ์์
๊ณผ ์์์ ์ค์ผ์ฅด๋ง์ ์ต์ ํํ ๊ฒ์ด๋ค. ๊ตฌ์ฒด์ ์ผ๋ก, ์ฐ๋ฆฌ๋ ์ธ๋ถ ์์
๋ค์ ์ธ๋ฐํ ๋จ์๋ก ์ํํจ์ผ๋ก์จ ์์ ๊ฒฝ์์ ์ต์ ํ๊ณ , ์๋ก๋ฅผ ๋ณด์ํ๋ ์์ ์ฌ์ฉ ํจํด์ ๋ณด์ด๋ ์์
๋ค์ ๊ฐ์ ์์์ ํจ๊ป ์์น์์ผ ์์ ํ์ฉ์จ์ ๋์ด์ฌ๋ ธ๋ค. ํจ๊ป ์์นํ ์์
๋ค์ ๋ฉ๋ชจ๋ฆฌ ์๋ ฅ์ ๊ฒฝ๊ฐ์ํค๊ธฐ ์ํด ์ฐ๋ฆฌ๋ ๋์ ์ผ๋ก ๋ฐ์ดํฐ๋ฅผ ๋์คํฌ๋ก ๋ด๋ ธ๋ค๊ฐ ๋ค์ ๋ฉ๋ชจ๋ฆฌ๋ก ์ฝ์ด์ค๋ ๊ธฐ๋ฅ์ ์ง์ํจ๊ณผ ๋์์, ๋์คํฌ์ ๋ฉ๋ชจ๋ฆฌ๊ฐ์ ๋ฐ์ดํฐ ๋น์จ์ ์ํฉ์ ๋ง๊ฒ ์์คํ
์ด ์๋์ผ๋ก ๋ง์ถ๋๋ก ํ์๋ค.
์์ ํด๋ฒ๋ค์ ์ค์ฒดํํ๊ธฐ ์ํด, ์ค์ ๋์ํ๋ ์์คํ
์ ๋ง๋ค์๋ค. ๋๊ฐ์ง์ ํด๋ฒ์ ํ๋์ ์์คํ
์ ๊ตฌํํจ์ผ๋ก์จ, ๋์ ์ผ๋ก ์์
์ ๋จธ์ ๊ฐ์ ์ฎ๊ธฐ๊ณ ์์์ ์ฌํ ๋นํ ์ ์๋ ๋ฐํ์์ ๊ณต์ ํ๋ค. ํด๋น ์๋ฃจ์
๋ค์ ํจ๊ณผ๋ฅผ ๋ณด์ฌ์ฃผ๊ธฐ ์ํด, ์ด ์์คํ
์ ๋ง์ด ์ฌ์ฉ๋๋ ๊ธฐ๊ณ ํ์ต ์ดํ๋ฆฌ์ผ์ด์
์ผ๋ก ์คํํ์๊ณ ๊ธฐ์กด ์์คํ
๋ค ๋๋น ๋ฐ์ด๋ ์ฑ๋ฅ ํฅ์์ ๋ณด์ฌ์ฃผ์๋ค.Chapter1. Introduction 1
1.1 Distributed Machine Learning on Parameter Servers 1
1.2 Automating System Conguration of Distributed Machine Learning 2
1.3 Scheduling of Multiple Distributed Machine Learning Jobs 3
1.4 Contributions 5
1.5 Dissertation Structure 6
Chapter2. Background 7
Chapter3. Automating System Conguration of Distributed Machine Learning 10
3.1 System Conguration Challenges 11
3.2 Finding Good System Conguration 13
3.2.1 Cost Model 13
3.2.2 Cost Formulation 15
3.2.3 Optimization 16
3.3 Cruise 18
3.3.1 Optimizer 19
3.3.2 Elastic Runtime 21
3.4 Evaluation 26
3.4.1 Experimental Setup 26
3.4.2 Finding Baselines with Grid Search 28
3.4.3 Optimization in the Homogeneous Environment 28
3.4.4 Utilizing Opportunistic Resources 30
3.4.5 Optimization in the Heterogeneous Environment 31
3.4.6 Reconguration Speed 32
3.5 Related Work 33
3.6 Summary 34
Chapter4 A Scheduling Framework Optimized for Multiple Distributed Machine Learning Jobs 36
4.1 Resource Under-utilization Problems in PS ML Training 37
4.2 Harmony Overview 42
4.3 Multiplexing ML Jobs 43
4.3.1 Fine-grained Execution with Subtasks 44
4.3.2 Dynamic Grouping of Jobs 45
4.3.3 Dynamic Data Reloading 54
4.4 Evaluation 56
4.4.1 Baselines 56
4.4.2 Experimental Setup 57
4.4.3 Performance Comparison 59
4.4.4 Performance Breakdown 59
4.4.5 Workload Sensitivity Analysis 61
4.4.6 Accuracy of the Performance Model 63
4.4.7 Performance and Scalability of the Scheduling Algorithm 64
4.4.8 Dynamic Data Reloading 66
4.5 Discussion 67
4.6 Related Work 67
4.7 Summary 70
Chapter5 Conclusion 71
5.1 Summary 71
5.2 Future Work 71
5.2.1 Other Communication Architecture Support 71
5.2.2 Deep Learning & GPU Resource Support 72
์์ฝ 81Docto
CoCoA: A General Framework for Communication-Efficient Distributed Optimization
The scale of modern datasets necessitates the development of efficient
distributed optimization methods for machine learning. We present a
general-purpose framework for distributed computing environments, CoCoA, that
has an efficient communication scheme and is applicable to a wide variety of
problems in machine learning and signal processing. We extend the framework to
cover general non-strongly-convex regularizers, including L1-regularized
problems like lasso, sparse logistic regression, and elastic net
regularization, and show how earlier work can be derived as a special case. We
provide convergence guarantees for the class of convex regularized loss
minimization objectives, leveraging a novel approach in handling
non-strongly-convex regularizers and non-smooth loss functions. The resulting
framework has markedly improved performance over state-of-the-art methods, as
we illustrate with an extensive set of experiments on real distributed
datasets
Deep Learning in the Automotive Industry: Applications and Tools
Deep Learning refers to a set of machine learning techniques that utilize
neural networks with many hidden layers for tasks, such as image
classification, speech recognition, language understanding. Deep learning has
been proven to be very effective in these domains and is pervasively used by
many Internet services. In this paper, we describe different automotive uses
cases for deep learning in particular in the domain of computer vision. We
surveys the current state-of-the-art in libraries, tools and infrastructures
(e.\,g.\ GPUs and clouds) for implementing, training and deploying deep neural
networks. We particularly focus on convolutional neural networks and computer
vision use cases, such as the visual inspection process in manufacturing plants
and the analysis of social media data. To train neural networks, curated and
labeled datasets are essential. In particular, both the availability and scope
of such datasets is typically very limited. A main contribution of this paper
is the creation of an automotive dataset, that allows us to learn and
automatically recognize different vehicle properties. We describe an end-to-end
deep learning application utilizing a mobile app for data collection and
process support, and an Amazon-based cloud backend for storage and training.
For training we evaluate the use of cloud and on-premises infrastructures
(including multiple GPUs) in conjunction with different neural network
architectures and frameworks. We assess both the training times as well as the
accuracy of the classifier. Finally, we demonstrate the effectiveness of the
trained classifier in a real world setting during manufacturing process.Comment: 10 page
Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation
TensorFlow has been the most widely adopted Machine/Deep Learning framework.
However, little exists in the literature that provides a thorough understanding
of the capabilities which TensorFlow offers for the distributed training of
large ML/DL models that need computation and communication at scale. Most
commonly used distributed training approaches for TF can be categorized as
follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand
Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu
Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this
paper, we provide an in-depth performance characterization and analysis of
these distributed training approaches on various GPU clusters including the Piz
Daint system (6 on Top500). We perform experiments to gain novel insights along
the following vectors: 1) Application-level scalability of DNN training, 2)
Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used
for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on
these experiments, we present two key insights: 1) Overall, No-gRPC designs
achieve better performance compared to gRPC-based approaches for most
configurations, and 2) The performance of No-gRPC is heavily influenced by the
gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware
MPI Allreduce design that exploits CUDA kernels and pointer caching to perform
large reductions efficiently. Our proposed designs offer 5-17X better
performance than NCCL2 for small and medium messages, and reduces latency by
29% for large messages. The proposed optimizations help Horovod-MPI to achieve
approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs.
Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native
gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint
cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie
Comparative Analysis of Open Source Frameworks for Machine Learning with Use Case in Single-Threaded and Multi-Threaded Modes
The basic features of some of the most versatile and popular open source
frameworks for machine learning (TensorFlow, Deep Learning4j, and H2O) are
considered and compared. Their comparative analysis was performed and
conclusions were made as to the advantages and disadvantages of these
platforms. The performance tests for the de facto standard MNIST data set were
carried out on H2O framework for deep learning algorithms designed for CPU and
GPU platforms for single-threaded and multithreaded modes of operation.Comment: 4 pages, 6 figures, 4 tables; XIIth International Scientific and
Technical Conference on Computer Sciences and Information Technologies (CSIT
2017), Lviv, Ukrain
Machine Learning in Wireless Sensor Networks: Algorithms, Strategies, and Applications
Wireless sensor networks monitor dynamic environments that change rapidly
over time. This dynamic behavior is either caused by external factors or
initiated by the system designers themselves. To adapt to such conditions,
sensor networks often adopt machine learning techniques to eliminate the need
for unnecessary redesign. Machine learning also inspires many practical
solutions that maximize resource utilization and prolong the lifespan of the
network. In this paper, we present an extensive literature review over the
period 2002-2013 of machine learning methods that were used to address common
issues in wireless sensor networks (WSNs). The advantages and disadvantages of
each proposed algorithm are evaluated against the corresponding problem. We
also provide a comparative guide to aid WSN designers in developing suitable
machine learning solutions for their specific application challenges.Comment: Accepted for publication in IEEE Communications Surveys and Tutorial
Performance Analysis of Open Source Machine Learning Frameworks for Various Parameters in Single-Threaded and Multi-Threaded Modes
The basic features of some of the most versatile and popular open source
frameworks for machine learning (TensorFlow, Deep Learning4j, and H2O) are
considered and compared. Their comparative analysis was performed and
conclusions were made as to the advantages and disadvantages of these
platforms. The performance tests for the de facto standard MNIST data set were
carried out on H2O framework for deep learning algorithms designed for CPU and
GPU platforms for single-threaded and multithreaded modes of operation Also, we
present the results of testing neural networks architectures on H2O platform
for various activation functions, stopping metrics, and other parameters of
machine learning algorithm. It was demonstrated for the use case of MNIST
database of handwritten digits in single-threaded mode that blind selection of
these parameters can hugely increase (by 2-3 orders) the runtime without the
significant increase of precision. This result can have crucial influence for
optimization of available and new machine learning methods, especially for
image recognition problems.Comment: 15 pages, 11 figures, 4 tables; this paper summarizes the activities
which were started recently and described shortly in the previous conference
presentations arXiv:1706.02248 and arXiv:1707.04940; it is accepted for
Springer book series "Advances in Intelligent Systems and Computing
- โฆ