479 research outputs found
Learning Scheduling Algorithms for Data Processing Clusters
Efficiently scheduling data processing jobs on distributed compute clusters
requires complex algorithms. Current systems, however, use simple generalized
heuristics and ignore workload characteristics, since developing and tuning a
scheduling policy for each workload is infeasible. In this paper, we show that
modern machine learning techniques can generate highly-efficient policies
automatically. Decima uses reinforcement learning (RL) and neural networks to
learn workload-specific scheduling algorithms without any human instruction
beyond a high-level objective such as minimizing average job completion time.
Off-the-shelf RL techniques, however, cannot handle the complexity and scale of
the scheduling problem. To build Decima, we had to develop new representations
for jobs' dependency graphs, design scalable RL models, and invent RL training
methods for dealing with continuous stochastic job arrivals. Our prototype
integration with Spark on a 25-node cluster shows that Decima improves the
average job completion time over hand-tuned scheduling heuristics by at least
21%, achieving up to 2x improvement during periods of high cluster load
Workload Schedulers - Genesis, Algorithms and Comparisons
In this article we provide brief descriptions of three classes of schedulers: Operating Systems Process Schedulers, Cluster Systems, Jobs Schedulers and Big Data Schedulers. We describe their evolution from early adoptions to modern implementations, considering both the use and features of algorithms. In summary, we discuss differences between all presented classes of schedulers and discuss their chronological development. In conclusion, we highlight similarities in the focus of scheduling strategies design, applicable to both local and distributed systems
Topology-aware equipartitioning with coscheduling on multicore systems
Over the last decade, multicore architectures have become omnipresent. Today, they are used in the whole product range from server systems to handheld computers. The deployed software still undergoes the slow transition from sequential to parallel. This transition, however, is gaining more and more momentum due to the increased availability of more sophisticated parallel programming environments, which replace the some-times crude results of ad-hoc parallelization. Combined with the ever increasing complexity of multicore architectures, this results in a scheduling problem that is different from what it has been, because features such as non-uniform memory access, shared caches, or simultaneous multithreading have to be considered. In this paper, we compare different ways of scheduling multiple parallel applications. Due to emerging parallel programming environments, we only consider malleable applications, i. e., applications where the parallelism degree can be changed on the fly. We propose a topology-aware scheduling scheme that combines equipartitioning and coscheduling. It does not suffer from the drawbacks of the individual concepts and also allows to run applications at different degrees of parallelisms without compromising fairness. We find that topology-awareness increases performance for all evaluated workloads. The combination with coscheduling is more sensitive towards the executed workloads. However, the gained versatility allows new use cases to be explored, which were not possible before
Saturn: An Optimized Data System for Large Model Deep Learning Workloads
Large language models such as GPT-3 & ChatGPT have transformed deep learning
(DL), powering applications that have captured the public's imagination. These
models are rapidly being adopted across domains for analytics on various
modalities, often by finetuning pre-trained base models. Such models need
multiple GPUs due to both their size and computational load, driving the
development of a bevy of "model parallelism" techniques & tools. Navigating
such parallelism choices, however, is a new burden for end users of DL such as
data scientists, domain scientists, etc. who may lack the necessary systems
knowhow. The need for model selection, which leads to many models to train due
to hyper-parameter tuning or layer-wise finetuning, compounds the situation
with two more burdens: resource apportioning and scheduling. In this work, we
tackle these three burdens for DL users in a unified manner by formalizing them
as a joint problem that we call SPASE: Select a Parallelism, Allocate
resources, and SchedulE. We propose a new information system architecture to
tackle the SPASE problem holistically, representing a key step toward enabling
wider adoption of large DL models. We devise an extensible template for
existing parallelism schemes and combine it with an automated empirical
profiler for runtime estimation. We then formulate SPASE as an MILP.
We find that direct use of an MILP-solver is significantly more effective
than several baseline heuristics. We optimize the system runtime further with
an introspective scheduling approach. We implement all these techniques into a
new data system we call Saturn. Experiments with benchmark DL workloads show
that Saturn achieves 39-49% lower model selection runtimes than typical current
DL practice.Comment: Under submission at VLDB. Code available:
https://github.com/knagrecha/saturn. 12 pages + 3 pages references + 2 pages
appendi
Methods to Improve Applicability and Efficiency of Distributed Data-Centric Compute Frameworks
The success of modern applications depends on the insights they collect from their data repositories. Data repositories for such applications currently exceed exabytes and are rapidly increasing in size, as they collect data from varied sources - web applications, mobile phones, sensors and other connected devices. Distributed storage and data-centric compute frameworks have been invented to store and analyze these large datasets. This dissertation focuses on extending the applicability and improving the efficiency of distributed data-centric compute frameworks
- …