49,285 research outputs found
Online Job Scheduling in Distributed Machine Learning Clusters
Nowadays large-scale distributed machine learning systems have been deployed
to support various analytics and intelligence services in IT firms. To train a
large dataset and derive the prediction/inference model, e.g., a deep neural
network, multiple workers are run in parallel to train partitions of the input
dataset, and update shared model parameters. In a shared cluster handling
multiple training jobs, a fundamental issue is how to efficiently schedule jobs
and set the number of concurrent workers to run for each job, such that server
resources are maximally utilized and model training can be completed in time.
Targeting a distributed machine learning system using the parameter server
framework, we design an online algorithm for scheduling the arriving jobs and
deciding the adjusted numbers of concurrent workers and parameter servers for
each job over its course, to maximize overall utility of all jobs, contingent
on their completion times. Our online algorithm design utilizes a primal-dual
framework coupled with efficient dual subroutines, achieving good long-term
performance guarantees with polynomial time complexity. Practical effectiveness
of the online algorithm is evaluated using trace-driven simulation and testbed
experiments, which demonstrate its outperformance as compared to commonly
adopted scheduling algorithms in today's cloud systems
The Online Knapsack Problem with Departures
The online knapsack problem is a classic online resource allocation problem
in networking and operations research. Its basic version studies how to pack
online arriving items of different sizes and values into a capacity-limited
knapsack. In this paper, we study a general version that includes item
departures, while also considering multiple knapsacks and multi-dimensional
item sizes. We design a threshold-based online algorithm and prove that the
algorithm can achieve order-optimal competitive ratios. Beyond worst-case
performance guarantees, we also aim to achieve near-optimal average performance
under typical instances. Towards this goal, we propose a data-driven online
algorithm that learns within a policy-class that guarantees a worst-case
performance bound. In trace-driven experiments, we show that our data-driven
algorithm outperforms other benchmark algorithms in an application of online
knapsack to job scheduling for cloud computing
Truth and Regret in Online Scheduling
We consider a scheduling problem where a cloud service provider has multiple
units of a resource available over time. Selfish clients submit jobs, each with
an arrival time, deadline, length, and value. The service provider's goal is to
implement a truthful online mechanism for scheduling jobs so as to maximize the
social welfare of the schedule. Recent work shows that under a stochastic
assumption on job arrivals, there is a single-parameter family of mechanisms
that achieves near-optimal social welfare. We show that given any such family
of near-optimal online mechanisms, there exists an online mechanism that in the
worst case performs nearly as well as the best of the given mechanisms. Our
mechanism is truthful whenever the mechanisms in the given family are truthful
and prompt, and achieves optimal (within constant factors) regret.
We model the problem of competing against a family of online scheduling
mechanisms as one of learning from expert advice. A primary challenge is that
any scheduling decisions we make affect not only the payoff at the current
step, but also the resource availability and payoffs in future steps.
Furthermore, switching from one algorithm (a.k.a. expert) to another in an
online fashion is challenging both because it requires synchronization with the
state of the latter algorithm as well as because it affects the incentive
structure of the algorithms. We further show how to adapt our algorithm to a
non-clairvoyant setting where job lengths are unknown until jobs are run to
completion. Once again, in this setting, we obtain truthfulness along with
asymptotically optimal regret (within poly-logarithmic factors)
Learning Scheduling Algorithms for Data Processing Clusters
Efficiently scheduling data processing jobs on distributed compute clusters
requires complex algorithms. Current systems, however, use simple generalized
heuristics and ignore workload characteristics, since developing and tuning a
scheduling policy for each workload is infeasible. In this paper, we show that
modern machine learning techniques can generate highly-efficient policies
automatically. Decima uses reinforcement learning (RL) and neural networks to
learn workload-specific scheduling algorithms without any human instruction
beyond a high-level objective such as minimizing average job completion time.
Off-the-shelf RL techniques, however, cannot handle the complexity and scale of
the scheduling problem. To build Decima, we had to develop new representations
for jobs' dependency graphs, design scalable RL models, and invent RL training
methods for dealing with continuous stochastic job arrivals. Our prototype
integration with Spark on a 25-node cluster shows that Decima improves the
average job completion time over hand-tuned scheduling heuristics by at least
21%, achieving up to 2x improvement during periods of high cluster load
Truthful Online Scheduling with Commitments
We study online mechanisms for preemptive scheduling with deadlines, with the
goal of maximizing the total value of completed jobs. This problem is
fundamental to deadline-aware cloud scheduling, but there are strong lower
bounds even for the algorithmic problem without incentive constraints. However,
these lower bounds can be circumvented under the natural assumption of deadline
slackness, i.e., that there is a guaranteed lower bound on the ratio
between a job's size and the time window in which it can be executed.
In this paper, we construct a truthful scheduling mechanism with a constant
competitive ratio, given slackness . Furthermore, we show that if is
large enough then we can construct a mechanism that also satisfies a commitment
property: it can be determined whether or not a job will finish, and the
requisite payment if so, well in advance of each job's deadline. This is
notable because, in practice, users with strict deadlines may find it
unacceptable to discover only very close to their deadline that their job has
been rejected
Reservation-Based Federated Scheduling for Parallel Real-Time Tasks
This paper considers the scheduling of parallel real-time tasks with
arbitrary-deadlines. Each job of a parallel task is described as a directed
acyclic graph (DAG). In contrast to prior work in this area, where
decomposition-based scheduling algorithms are proposed based on the
DAG-structure and inter-task interference is analyzed as self-suspending
behavior, this paper generalizes the federated scheduling approach. We propose
a reservation-based algorithm, called reservation-based federated scheduling,
that dominates federated scheduling. We provide general constraints for the
design of such systems and prove that reservation-based federated scheduling
has a constant speedup factor with respect to any optimal DAG task scheduler.
Furthermore, the presented algorithm can be used in conjunction with any
scheduler and scheduling analysis suitable for ordinary arbitrary-deadline
sporadic task sets, i.e., without parallelism
- …