Search CORE

479 research outputs found

Learning Scheduling Algorithms for Data Processing Clusters

Author: Abadi Martín
Addanki Ravichandra
Dai Hanjun
Finn Chelsea
Ghodsi Ali
Gog Ionel
Grandl Robert
Greensmith Evan
Hindman Benjamin
Kingma Diederik P
Mao Hongzi
Mao Hongzi
Marcus Ryan
Mirhoseini Azalia
Mirhoseini Azalia
Pinto Lerrel
Schulman John
Spark Apache
Sutton S.
Weaver Lex
Zaharia Matei
Publication venue
Publication date: 21/08/2019
Field of study

Efficiently scheduling data processing jobs on distributed compute clusters requires complex algorithms. Current systems, however, use simple generalized heuristics and ignore workload characteristics, since developing and tuning a scheduling policy for each workload is infeasible. In this paper, we show that modern machine learning techniques can generate highly-efficient policies automatically. Decima uses reinforcement learning (RL) and neural networks to learn workload-specific scheduling algorithms without any human instruction beyond a high-level objective such as minimizing average job completion time. Off-the-shelf RL techniques, however, cannot handle the complexity and scale of the scheduling problem. To build Decima, we had to develop new representations for jobs' dependency graphs, design scalable RL models, and invent RL training methods for dealing with continuous stochastic job arrivals. Our prototype integration with Spark on a 25-node cluster shows that Decima improves the average job completion time over hand-tuned scheduling heuristics by at least 21%, achieving up to 2x improvement during periods of high cluster load

arXiv.org e-Print Archive

Crossref

DSpace@MIT

Workload Schedulers - Genesis, Algorithms and Comparisons

Author: Getov Vladimir
Getov Vladimir
Sliwko L.
Sliwko L.
Publication venue: IJCSSE
Publication date: 01/01/2015
Field of study

In this article we provide brief descriptions of three classes of schedulers: Operating Systems Process Schedulers, Cluster Systems, Jobs Schedulers and Big Data Schedulers. We describe their evolution from early adoptions to modern implementations, considering both the use and features of algorithms. In summary, we discuss differences between all presented classes of schedulers and discuss their chronological development. In conclusion, we highlight similarities in the focus of scheduling strategies design, applicable to both local and distributed systems

WestminsterResearch

Topology-aware equipartitioning with coscheduling on multicore systems

Author: Juurlink Ben
Richling Jan
Schönherr Jan Hendrik
Publication venue
Publication date: 01/01/2013
Field of study

Over the last decade, multicore architectures have become omnipresent. Today, they are used in the whole product range from server systems to handheld computers. The deployed software still undergoes the slow transition from sequential to parallel. This transition, however, is gaining more and more momentum due to the increased availability of more sophisticated parallel programming environments, which replace the some-times crude results of ad-hoc parallelization. Combined with the ever increasing complexity of multicore architectures, this results in a scheduling problem that is different from what it has been, because features such as non-uniform memory access, shared caches, or simultaneous multithreading have to be considered. In this paper, we compare different ways of scheduling multiple parallel applications. Due to emerging parallel programming environments, we only consider malleable applications, i. e., applications where the parallelism degree can be changed on the fly. We propose a topology-aware scheduling scheme that combines equipartitioning and coscheduling. It does not suffer from the drawbacks of the individual concepts and also allows to run applications at different degrees of parallelisms without compromising fairness. We find that topology-awareness increases performance for all evaluated workloads. The combination with coscheduling is more sensitive towards the executed workloads. However, the gained versatility allows new use cases to be explored, which were not possible before

DepositOnce

Crossref

Saturn: An Optimized Data System for Large Model Deep Learning Workloads

Author: Kumar Arun
Nagrecha Kabir
Publication venue
Publication date: 03/09/2023
Field of study

Large language models such as GPT-3 & ChatGPT have transformed deep learning (DL), powering applications that have captured the public's imagination. These models are rapidly being adopted across domains for analytics on various modalities, often by finetuning pre-trained base models. Such models need multiple GPUs due to both their size and computational load, driving the development of a bevy of "model parallelism" techniques & tools. Navigating such parallelism choices, however, is a new burden for end users of DL such as data scientists, domain scientists, etc. who may lack the necessary systems knowhow. The need for model selection, which leads to many models to train due to hyper-parameter tuning or layer-wise finetuning, compounds the situation with two more burdens: resource apportioning and scheduling. In this work, we tackle these three burdens for DL users in a unified manner by formalizing them as a joint problem that we call SPASE: Select a Parallelism, Allocate resources, and SchedulE. We propose a new information system architecture to tackle the SPASE problem holistically, representing a key step toward enabling wider adoption of large DL models. We devise an extensible template for existing parallelism schemes and combine it with an automated empirical profiler for runtime estimation. We then formulate SPASE as an MILP. We find that direct use of an MILP-solver is significantly more effective than several baseline heuristics. We optimize the system runtime further with an introspective scheduling approach. We implement all these techniques into a new data system we call Saturn. Experiments with benchmark DL workloads show that Saturn achieves 39-49% lower model selection runtimes than typical current DL practice.Comment: Under submission at VLDB. Code available: https://github.com/knagrecha/saturn. 12 pages + 3 pages references + 2 pages appendi

arXiv.org e-Print Archive

Methods to Improve Applicability and Efficiency of Distributed Data-Centric Compute Frameworks

Author: Kambatla Karthik Shashank
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2016
Field of study

The success of modern applications depends on the insights they collect from their data repositories. Data repositories for such applications currently exceed exabytes and are rapidly increasing in size, as they collect data from varied sources - web applications, mobile phones, sensors and other connected devices. Distributed storage and data-centric compute frameworks have been invented to store and analyze these large datasets. This dissertation focuses on extending the applicability and improving the efficiency of distributed data-centric compute frameworks

Purdue E-Pubs