7,291 research outputs found
Learning Scheduling Algorithms for Data Processing Clusters
Efficiently scheduling data processing jobs on distributed compute clusters
requires complex algorithms. Current systems, however, use simple generalized
heuristics and ignore workload characteristics, since developing and tuning a
scheduling policy for each workload is infeasible. In this paper, we show that
modern machine learning techniques can generate highly-efficient policies
automatically. Decima uses reinforcement learning (RL) and neural networks to
learn workload-specific scheduling algorithms without any human instruction
beyond a high-level objective such as minimizing average job completion time.
Off-the-shelf RL techniques, however, cannot handle the complexity and scale of
the scheduling problem. To build Decima, we had to develop new representations
for jobs' dependency graphs, design scalable RL models, and invent RL training
methods for dealing with continuous stochastic job arrivals. Our prototype
integration with Spark on a 25-node cluster shows that Decima improves the
average job completion time over hand-tuned scheduling heuristics by at least
21%, achieving up to 2x improvement during periods of high cluster load
A Taxonomy of Data Grids for Distributed Data Sharing, Management and Processing
Data Grids have been adopted as the platform for scientific communities that
need to share, access, transport, process and manage large data collections
distributed worldwide. They combine high-end computing technologies with
high-performance networking and wide-area storage management techniques. In
this paper, we discuss the key concepts behind Data Grids and compare them with
other data sharing and distribution paradigms such as content delivery
networks, peer-to-peer networks and distributed databases. We then provide
comprehensive taxonomies that cover various aspects of architecture, data
transportation, data replication and resource allocation and scheduling.
Finally, we map the proposed taxonomy to various Data Grid systems not only to
validate the taxonomy but also to identify areas for future exploration.
Through this taxonomy, we aim to categorise existing systems to better
understand their goals and their methodology. This would help evaluate their
applicability for solving similar problems. This taxonomy also provides a "gap
analysis" of this area through which researchers can potentially identify new
issues for investigation. Finally, we hope that the proposed taxonomy and
mapping also helps to provide an easy way for new practitioners to understand
this complex area of research.Comment: 46 pages, 16 figures, Technical Repor
RAPID: Enabling Fast Online Policy Learning in Dynamic Public Cloud Environments
Resource sharing between multiple workloads has become a prominent practice
among cloud service providers, motivated by demand for improved resource
utilization and reduced cost of ownership. Effective resource sharing, however,
remains an open challenge due to the adverse effects that resource contention
can have on high-priority, user-facing workloads with strict Quality of Service
(QoS) requirements. Although recent approaches have demonstrated promising
results, those works remain largely impractical in public cloud environments
since workloads are not known in advance and may only run for a brief period,
thus prohibiting offline learning and significantly hindering online learning.
In this paper, we propose RAPID, a novel framework for fast, fully-online
resource allocation policy learning in highly dynamic operating environments.
RAPID leverages lightweight QoS predictions, enabled by
domain-knowledge-inspired techniques for sample efficiency and bias reduction,
to decouple control from conventional feedback sources and guide policy
learning at a rate orders of magnitude faster than prior work. Evaluation on a
real-world server platform with representative cloud workloads confirms that
RAPID can learn stable resource allocation policies in minutes, as compared
with hours in prior state-of-the-art, while improving QoS by 9.0x and
increasing best-effort workload performance by 19-43%
- …