Search CORE

49,285 research outputs found

Online Job Scheduling in Distributed Machine Learning Clusters

Author: Bao Yixin
Li Zongpeng
Peng Yanghua
Wu Chuan
Publication venue
Publication date: 03/01/2018
Field of study

Nowadays large-scale distributed machine learning systems have been deployed to support various analytics and intelligence services in IT firms. To train a large dataset and derive the prediction/inference model, e.g., a deep neural network, multiple workers are run in parallel to train partitions of the input dataset, and update shared model parameters. In a shared cluster handling multiple training jobs, a fundamental issue is how to efficiently schedule jobs and set the number of concurrent workers to run for each job, such that server resources are maximally utilized and model training can be completed in time. Targeting a distributed machine learning system using the parameter server framework, we design an online algorithm for scheduling the arriving jobs and deciding the adjusted numbers of concurrent workers and parameter servers for each job over its course, to maximize overall utility of all jobs, contingent on their completion times. Our online algorithm design utilizes a primal-dual framework coupled with efficient dual subroutines, achieving good long-term performance guarantees with polynomial time complexity. Practical effectiveness of the online algorithm is evaluated using trace-driven simulation and testbed experiments, which demonstrate its outperformance as compared to commonly adopted scheduling algorithms in today's cloud systems

arXiv.org e-Print Archive

Crossref

The Online Knapsack Problem with Departures

Author: Hajiesmaili Mohammad
Lui John C. S.
Sun Bo
Towsley Don
Tsang Danny H. K.
Wierman Adam
Yang Lin
Publication venue
Publication date: 04/11/2022
Field of study

The online knapsack problem is a classic online resource allocation problem in networking and operations research. Its basic version studies how to pack online arriving items of different sizes and values into a capacity-limited knapsack. In this paper, we study a general version that includes item departures, while also considering multiple knapsacks and multi-dimensional item sizes. We design a threshold-based online algorithm and prove that the algorithm can achieve order-optimal competitive ratios. Beyond worst-case performance guarantees, we also aim to achieve near-optimal average performance under typical instances. Towards this goal, we propose a data-driven online algorithm that learns within a policy-class that guarantees a worst-case performance bound. In trace-driven experiments, we show that our data-driven algorithm outperforms other benchmark algorithms in an application of online knapsack to job scheduling for cloud computing

arXiv.org e-Print Archive

Truth and Regret in Online Scheduling

Author: Chawla Shuchi
Devanur Nikhil
Kulkarni Janardhan
Niazadeh Rad
Publication venue
Publication date: 01/03/2017
Field of study

We consider a scheduling problem where a cloud service provider has multiple units of a resource available over time. Selfish clients submit jobs, each with an arrival time, deadline, length, and value. The service provider's goal is to implement a truthful online mechanism for scheduling jobs so as to maximize the social welfare of the schedule. Recent work shows that under a stochastic assumption on job arrivals, there is a single-parameter family of mechanisms that achieves near-optimal social welfare. We show that given any such family of near-optimal online mechanisms, there exists an online mechanism that in the worst case performs nearly as well as the best of the given mechanisms. Our mechanism is truthful whenever the mechanisms in the given family are truthful and prompt, and achieves optimal (within constant factors) regret. We model the problem of competing against a family of online scheduling mechanisms as one of learning from expert advice. A primary challenge is that any scheduling decisions we make affect not only the payoff at the current step, but also the resource availability and payoffs in future steps. Furthermore, switching from one algorithm (a.k.a. expert) to another in an online fashion is challenging both because it requires synchronization with the state of the latter algorithm as well as because it affects the incentive structure of the algorithms. We further show how to adapt our algorithm to a non-clairvoyant setting where job lengths are unknown until jobs are run to completion. Once again, in this setting, we obtain truthfulness along with asymptotically optimal regret (within poly-logarithmic factors)

arXiv.org e-Print Archive

Crossref

Learning Scheduling Algorithms for Data Processing Clusters

Author: Abadi Martín
Addanki Ravichandra
Dai Hanjun
Finn Chelsea
Ghodsi Ali
Gog Ionel
Grandl Robert
Greensmith Evan
Hindman Benjamin
Kingma Diederik P
Mao Hongzi
Mao Hongzi
Marcus Ryan
Mirhoseini Azalia
Mirhoseini Azalia
Pinto Lerrel
Schulman John
Spark Apache
Sutton S.
Weaver Lex
Zaharia Matei
Publication venue
Publication date: 21/08/2019
Field of study

Efficiently scheduling data processing jobs on distributed compute clusters requires complex algorithms. Current systems, however, use simple generalized heuristics and ignore workload characteristics, since developing and tuning a scheduling policy for each workload is infeasible. In this paper, we show that modern machine learning techniques can generate highly-efficient policies automatically. Decima uses reinforcement learning (RL) and neural networks to learn workload-specific scheduling algorithms without any human instruction beyond a high-level objective such as minimizing average job completion time. Off-the-shelf RL techniques, however, cannot handle the complexity and scale of the scheduling problem. To build Decima, we had to develop new representations for jobs' dependency graphs, design scalable RL models, and invent RL training methods for dealing with continuous stochastic job arrivals. Our prototype integration with Spark on a 25-node cluster shows that Decima improves the average job completion time over hand-tuned scheduling heuristics by at least 21%, achieving up to 2x improvement during periods of high cluster load

arXiv.org e-Print Archive

Crossref

DSpace@MIT

Truthful Online Scheduling with Commitments

Author: Azar Yossi
Joseph
Kalp-Shaltiel Inna
Lucier Brendan
Menache Ishai
Naor
Yaniv Jonathan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 02/07/2015
Field of study

We study online mechanisms for preemptive scheduling with deadlines, with the goal of maximizing the total value of completed jobs. This problem is fundamental to deadline-aware cloud scheduling, but there are strong lower bounds even for the algorithmic problem without incentive constraints. However, these lower bounds can be circumvented under the natural assumption of deadline slackness, i.e., that there is a guaranteed lower bound

s > 1

on the ratio between a job's size and the time window in which it can be executed. In this paper, we construct a truthful scheduling mechanism with a constant competitive ratio, given slackness

s > 1

. Furthermore, we show that if

s

is large enough then we can construct a mechanism that also satisfies a commitment property: it can be determined whether or not a job will finish, and the requisite payment if so, well in advance of each job's deadline. This is notable because, in practice, users with strict deadlines may find it unacceptable to discover only very close to their deadline that their job has been rejected

arXiv.org e-Print Archive

Crossref

Reservation-Based Federated Scheduling for Parallel Real-Time Tasks

Author: Agrawal Kunal
Chen Jian-Jia
Li Jing
Ueter Niklas
von der Brüggen Georg
Publication venue
Publication date: 13/12/2017
Field of study

This paper considers the scheduling of parallel real-time tasks with arbitrary-deadlines. Each job of a parallel task is described as a directed acyclic graph (DAG). In contrast to prior work in this area, where decomposition-based scheduling algorithms are proposed based on the DAG-structure and inter-task interference is analyzed as self-suspending behavior, this paper generalizes the federated scheduling approach. We propose a reservation-based algorithm, called reservation-based federated scheduling, that dominates federated scheduling. We provide general constraints for the design of such systems and prove that reservation-based federated scheduling has a constant speedup factor with respect to any optimal DAG task scheduler. Furthermore, the presented algorithm can be used in conjunction with any scheduler and scheduling analysis suitable for ordinary arbitrary-deadline sporadic task sets, i.e., without parallelism

arXiv.org e-Print Archive

Crossref