13,682 research outputs found
Stochastic scheduling and workload allocation : QoS support and profitable brokering in computing grids
Abstract: The Grid can be seen as a collection of services each of which performs some functionality. Users of the Grid seek to use combinations of these services to perform the overall task they need to achieve. In general this can be seen as aset of services with a workflow document describing how these services should be combined. The user may also have certain constraints on the workflow operations, such as execution time or cost ----t~ th~ user, specified in the form of a Quality of Service (QoS) document. The users . submit their workflow to a brokering service along with the QoS document. The brokering service's task is to map any given workflow to a subset of the Grid services taking the QoS and state of the Grid into account -- service availability and performance. We propose an approach for generating constraint equations describing the workflow, the QoS requirements and the state of the Grid. This set of equations may be solved using Mixed-Integer Linear Programming (MILP), which is the traditional method. We further develop a novel 2-stage stochastic MILP which is capable of dealing with the volatile nature of the Grid and adapting the selection of the services during the lifetime of the workflow. We present experimental results comparing our approaches, showing that the . 2-stage stochastic programming approach performs consistently better than other traditional approaches. Next we addresses workload allocation techniques for Grid workflows in a multi-cluster Grid We model individual clusters as MIMIk. queues and obtain a numerical solutio~ for missed deadlines (failures) of tasks of Grid workflows. We also present an efficient algorithm for obtaining workload allocations of clusters. Next we model individual cluster resources as G/G/l queues and solve an optimisation problem that minimises QoS requirement violation, provides QoS guarantee and outperforms reservation based scheduling algorithms. Both approaches are evaluated through an experimental simulation and the results confirm that the proposed workload allocation strategies combined with traditional scheduling algorithms performs considerably better in terms of satisfying QoS requirements of Grid workflows than scheduling algorithms that don't employ such workload allocation techniques. Next we develop a novel method for Grid brokers that aims at maximising profit whilst satisfying end-user needs with a sufficient guarantee in a volatile utility Grid. We develop a develop a 2-stage stochastic MILP which is capable of dealing with the volatile nature . of the Grid and obtaining cost bounds that ensure that end-user cost is minimised or satisfied and broker's profit is maximised with sufficient guarantee. These bounds help brokers know beforehand whether the budget limits of end-users can be satisfied and. if not then???????? obtain appropriate future leases from service providers. Experimental results confirm the efficacy of our approach.Imperial Users onl
A Taxonomy for Management and Optimization of Multiple Resources in Edge Computing
Edge computing is promoted to meet increasing performance needs of
data-driven services using computational and storage resources close to the end
devices, at the edge of the current network. To achieve higher performance in
this new paradigm one has to consider how to combine the efficiency of resource
usage at all three layers of architecture: end devices, edge devices, and the
cloud. While cloud capacity is elastically extendable, end devices and edge
devices are to various degrees resource-constrained. Hence, an efficient
resource management is essential to make edge computing a reality. In this
work, we first present terminology and architectures to characterize current
works within the field of edge computing. Then, we review a wide range of
recent articles and categorize relevant aspects in terms of 4 perspectives:
resource type, resource management objective, resource location, and resource
use. This taxonomy and the ensuing analysis is used to identify some gaps in
the existing research. Among several research gaps, we found that research is
less prevalent on data, storage, and energy as a resource, and less extensive
towards the estimation, discovery and sharing objectives. As for resource
types, the most well-studied resources are computation and communication
resources. Our analysis shows that resource management at the edge requires a
deeper understanding of how methods applied at different levels and geared
towards different resource types interact. Specifically, the impact of mobility
and collaboration schemes requiring incentives are expected to be different in
edge architectures compared to the classic cloud solutions. Finally, we find
that fewer works are dedicated to the study of non-functional properties or to
quantifying the footprint of resource management techniques, including
edge-specific means of migrating data and services.Comment: Accepted in the Special Issue Mobile Edge Computing of the Wireless
Communications and Mobile Computing journa
Topology-aware GPU scheduling for learning workloads in cloud environments
Recent advances in hardware, such as systems with multiple GPUs and their availability in the cloud, are enabling deep learning in various domains including health care, autonomous vehicles, and Internet of Things. Multi-GPU systems exhibit complex connectivity among GPUs and between GPUs and CPUs. Workload schedulers must consider hardware topology and workload communication requirements in order to allocate CPU and GPU resources for optimal execution time and improved utilization in shared cloud environments.
This paper presents a new topology-aware workload placement strategy to schedule deep learning jobs on multi-GPU systems. The placement strategy is evaluated with a prototype on a Power8 machine with Tesla P100 cards, showing speedups of up to ≈1.30x compared to state-of-the-art strategies; the proposed algorithm achieves this result by allocating GPUs that satisfy workload requirements while preventing interference. Additionally, a large-scale simulation shows that the proposed strategy provides higher resource utilization and performance in cloud systems.This project is supported by the IBM/BSC Technology Center for Supercomputing
collaboration agreement. It has also received funding from the European Research Council (ERC) under the European Union’s Horizon
2020 research and innovation programme (grant agreement No 639595). It is
also partially supported by the Ministry of Economy of Spain under contract
TIN2015-65316-P and Generalitat de Catalunya under contract 2014SGR1051,
by the ICREA Academia program, and by the BSC-CNS Severo Ochoa program
(SEV-2015-0493). We thank our IBM Research colleagues Alaa Youssef
and Asser Tantawi for the valuable discussions. We also thank SC17 committee
member Blair Bethwaite of Monash University for his constructive feedback on the earlier drafts of this paper.Peer ReviewedPostprint (published version
Learning Scheduling Algorithms for Data Processing Clusters
Efficiently scheduling data processing jobs on distributed compute clusters
requires complex algorithms. Current systems, however, use simple generalized
heuristics and ignore workload characteristics, since developing and tuning a
scheduling policy for each workload is infeasible. In this paper, we show that
modern machine learning techniques can generate highly-efficient policies
automatically. Decima uses reinforcement learning (RL) and neural networks to
learn workload-specific scheduling algorithms without any human instruction
beyond a high-level objective such as minimizing average job completion time.
Off-the-shelf RL techniques, however, cannot handle the complexity and scale of
the scheduling problem. To build Decima, we had to develop new representations
for jobs' dependency graphs, design scalable RL models, and invent RL training
methods for dealing with continuous stochastic job arrivals. Our prototype
integration with Spark on a 25-node cluster shows that Decima improves the
average job completion time over hand-tuned scheduling heuristics by at least
21%, achieving up to 2x improvement during periods of high cluster load
- …