613 research outputs found
A task execution scheme for dew computing with state-of-the-art smartphones
The computing resources of today’s smartphones are underutilized most of the time. Using these resources could be highly beneficial in edge computing and fog computing contexts, for example, to support urban services for citizens. However, new challenges, especially regarding job scheduling, arise. Smartphones may form ad hoc networks, but individual devices highly differ in computational capabilities and (tolerable) energy usage. We take into account these particularities to validate a task execution scheme that relies on the computing power that clusters of mobile devices could provide. In this paper, we expand the study of several practical heuristics for job scheduling including execution scenarios with state-of-the-art smartphones. With the results of new simulated scenarios, we confirm previous findings and better comprehend the baseline approaches already proposed for the problem. This study also sheds some light on the capabilities of small-sized clusters comprising mid-range and low-end smartphones when the objective is to achieve real-time stream processing using Tensorflow object recognition models as edge jobs. Ultimately, we strive for industry applications to improve task scheduling for dew computing contexts. Heuristics such as ours plus supporting dew middleware could improve citizen participation by allowing a much wider use of dew computing resources, especially in urban contexts in order to help build smart cities.publishedVersio
Network Contention-Aware Cluster Scheduling with Reinforcement Learning
With continuous advances in deep learning, distributed training is becoming
common in GPU clusters. Specifically, for emerging workloads with diverse
amounts, ratios, and patterns of communication, we observe that network
contention can significantly degrade training throughput. However, widely used
scheduling policies often face limitations as they are agnostic to network
contention between jobs. In this paper, we present a new approach to mitigate
network contention in GPU clusters using reinforcement learning. We formulate
GPU cluster scheduling as a reinforcement learning problem and opt to learn a
network contention-aware scheduling policy that efficiently captures contention
sensitivities and dynamically adapts scheduling decisions through continuous
evaluation and improvement. We show that compared to widely used scheduling
policies, our approach reduces average job completion time by up to 18.2\% and
effectively cuts the tail job completion time by up to 20.7\% while allowing a
preferable trade-off between average job completion time and resource
utilization
Heterogeneity-aware scheduling and data partitioning for system performance acceleration
Over the past decade, heterogeneous processors and accelerators have become increasingly prevalent in modern computing systems. Compared with previous homogeneous parallel machines, the hardware heterogeneity in modern systems provides new opportunities and challenges for performance acceleration. Classic operating systems optimisation problems such as task scheduling, and application-specific optimisation techniques such as the adaptive data partitioning of parallel algorithms, are both required to work together to address hardware heterogeneity.
Significant effort has been invested in this problem, but either focuses on a specific type of heterogeneous systems or algorithm, or a high-level framework without insight into the difference in heterogeneity between different types of system. A general software framework is required, which can not only be adapted to multiple types of systems and workloads, but is also equipped with the techniques to address a variety of hardware heterogeneity.
This thesis presents approaches to design general heterogeneity-aware software frameworks for system performance acceleration. It covers a wide variety of systems, including an OS scheduler targeting on-chip asymmetric multi-core processors (AMPs) on mobile devices, a hierarchical many-core supercomputer and multi-FPGA systems for high performance computing (HPC) centers. Considering heterogeneity from on-chip AMPs, such as thread criticality, core sensitivity, and relative fairness, it suggests a collaborative based approach to co-design the task selector and core allocator on OS scheduler. Considering the typical sources of heterogeneity in HPC systems, such as the memory hierarchy, bandwidth limitations and asymmetric physical connection, it proposes an application-specific automatic data partitioning method for a modern supercomputer, and a topological-ranking heuristic based schedule for a multi-FPGA based reconfigurable cluster.
Experiments on both a full system simulator (GEM5) and real systems (Sunway Taihulight Supercomputer and Xilinx Multi-FPGA based clusters) demonstrate the significant advantages of the suggested approaches compared against the state-of-the-art on variety of workloads."This work is supported by St Leonards 7th Century Scholarship and
Computer Science PhD funding from University of St Andrews; by UK
EPSRC grant Discovery: Pattern Discovery and Program Shaping for Manycore
Systems (EP/P020631/1)." -- Acknowledgement
Energy-Efficient GPU Clusters Scheduling for Deep Learning
Training deep neural networks (DNNs) is a major workload in datacenters
today, resulting in a tremendously fast growth of energy consumption. It is
important to reduce the energy consumption while completing the DL training
jobs early in data centers. In this paper, we propose PowerFlow, a GPU clusters
scheduler that reduces the average Job Completion Time (JCT) under an energy
budget. We first present performance models for DL training jobs to predict the
throughput and energy consumption performance with different configurations.
Based on the performance models, PowerFlow dynamically allocates GPUs and
adjusts the GPU-level or job-level configurations of DL training jobs.
PowerFlow applies network packing and buddy allocation to job placement, thus
avoiding extra energy consumed by cluster fragmentations. Evaluation results
show that under the same energy consumption, PowerFlow improves the average JCT
by 1.57 - 3.39 x at most, compared to competitive baselines
GPU-enabled Function-as-a-Service for Machine Learning Inference
Function-as-a-Service (FaaS) is emerging as an important cloud computing
service model as it can improve the scalability and usability of a wide range
of applications, especially Machine-Learning (ML) inference tasks that require
scalable resources and complex software configurations. These inference tasks
heavily rely on GPUs to achieve high performance; however, support for GPUs is
currently lacking in the existing FaaS solutions. The unique event-triggered
and short-lived nature of functions poses new challenges to enabling GPUs on
FaaS, which must consider the overhead of transferring data (e.g., ML model
parameters and inputs/outputs) between GPU and host memory. This paper proposes
a novel GPU-enabled FaaS solution that enables ML inference functions to
efficiently utilize GPUs to accelerate their computations. First, it extends
existing FaaS frameworks such as OpenFaaS to support the scheduling and
execution of functions across GPUs in a FaaS cluster. Second, it provides
caching of ML models in GPU memory to improve the performance of model
inference functions and global management of GPU memories to improve cache
utilization. Third, it offers co-designed GPU function scheduling and cache
management to optimize the performance of ML inference functions. Specifically,
the paper proposes locality-aware scheduling, which maximizes the utilization
of both GPU memory for cache hits and GPU cores for parallel processing. A
thorough evaluation based on real-world traces and ML models shows that the
proposed GPU-enabled FaaS works well for ML inference tasks, and the proposed
locality-aware scheduler achieves a speedup of 48x compared to the default,
load balancing only schedulers
FedZero: Leveraging Renewable Excess Energy in Federated Learning
Federated Learning (FL) is an emerging machine learning technique that
enables distributed model training across data silos or edge devices without
data sharing. Yet, FL inevitably introduces inefficiencies compared to
centralized model training, which will further increase the already high energy
usage and associated carbon emissions of machine learning in the future.
Although the scheduling of workloads based on the availability of low-carbon
energy has received considerable attention in recent years, it has not yet been
investigated in the context of FL. However, FL is a highly promising use case
for carbon-aware computing, as training jobs constitute of energy-intensive
batch processes scheduled in geo-distributed environments.
We propose FedZero, a FL system that operates exclusively on renewable excess
energy and spare capacity of compute infrastructure to effectively reduce the
training's operational carbon emissions to zero. Based on energy and load
forecasts, FedZero leverages the spatio-temporal availability of excess energy
by cherry-picking clients for fast convergence and fair participation. Our
evaluation, based on real solar and load traces, shows that FedZero converges
considerably faster under the mentioned constraints than state-of-the-art
approaches, is highly scalable, and is robust against forecasting errors
- …