Search CORE

54 research outputs found

Tails in the cloud: a survey and taxonomy of straggler management within large-scale cloud data centres

Author: Garraghan P
Gill SS
Ouyang X
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 12/03/2020
Field of study

Cloud computing systems are splitting compute- and data-intensive jobs into smaller tasks to execute them in a parallel manner using clusters to improve execution time. However, such systems at increasing scale are exposed to stragglers, whereby abnormally slow running tasks executing within a job substantially affect job performance completion. Such stragglers are a direct threat towards attaining fast execution of data-intensive jobs within cloud computing. Researchers have proposed an assortment of different mechanisms, frameworks, and management techniques to detect and mitigate stragglers both proactively and reactively. In this paper, we present a comprehensive review of straggler management techniques within large-scale cloud data centres. We provide a detailed taxonomy of straggler causes, as well as proposed management and mitigation techniques based on straggler characteristics and properties. From this systematic review, we outline several outstanding challenges and potential directions of possible future work for straggler research

Queen Mary Research Online

Lancaster E-Prints

PRISM: An Experiment Framework for Straggler Analytics in Containerized Clusters

Author: Garraghan P
Gill SS
Lindsay D
The Fifth International Workshop on Container Technologies and Container Clouds (WoC 2019)
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

Containerized clusters of machines at scale that provision Cloud services are encountering substantive difficulties with stragglers – whereby a small subset of task execution negatively degrades system performance. Stragglers are an unsolved challenge due to a wide variety of root-causes and stochastic behavior. While there have been efforts to mitigate their effects, few works have attempted to empirically ascertain how system operational scenarios precisely influence straggler occurrence and severity. This challenge is further compounded with the difficulties of conducting experiments within real-world containerized clusters. System maintenance and experiment design are often error-prone and time-consuming processes, and a large portion of tools created for workload submission and straggler injection are bespoke to specific clusters, limiting experiment reproducibility. In this paper we propose PRISM, a framework that automates containerized cluster setup, experiment design, and experiment execution. Our framework is capable of deployment, configuration, execution, performance trace transformation and aggregation of containerized application frameworks, enabling scripted execution of diverse workloads and cluster configurations. The framework reduces time required for cluster setup and experiment execution from hours to minutes. We use PRISM to conduct automated experimentation of system operational conditions and identify straggler manifestation is affected by resource contention, input data size and scheduler architecture limitations

Crossref

Queen Mary Research Online

Lancaster E-Prints

Asynchronous federated and reinforcement learning for mobility-aware edge caching in IoVs

Author: Cao Yue
Jiang Kai
Song Yujie
Wan Shaohua
Zhang Xu
Zhou Huan
Publication venue
Publication date: 02/01/2024
Field of study

Edge caching is a promising technology to reduce backhaul strain and content access delay in Internet-of-Vehicles (IoVs). It pre-caches frequently-used contents close to vehicles through intermediate roadside units. Previous edge caching works often assume that content popularity is known in advance or obeys simplified models. However, such assumptions are unrealistic, as content popularity varies with uncertain spatial-temporal traffic demands in IoVs. Federated learning (FL) enables vehicles to predict popular content with distributed training. It preserves the training data remain local, thereby addressing privacy concerns and communication resource shortages. This paper investigates a mobility-aware edge caching strategy by exploiting asynchronous FL and Deep Reinforcement Learning (DRL). We first implement a novel asynchronous FL framework for local updates and global aggregation of Stacked AutoEncoder (SAE) models. Then, utilizing the latent features extracted by the trained SAE model, we adopt a hybrid filtering model for predicting and recommending popular content. Furthermore, we explore intelligent caching decisions after content prediction. Based on the formulated Markov Decision Process (MDP) problem, we propose a DRL-based solution, and adopt neural network-based parameter approximations for the curse of dimensionality in RL. Extensive simulations are conducted based on real-world data trajectory. Especially, our proposed method outperforms FedAvg, LRU, and NoDRL, and the edge hit rate is improved by roughly 6%, 21%, and 15%, respectively, when the cache capacity reaches 350 MB

University of East Anglia digital repository

A prescriptive analytics approach for energy efficiency in datacentres.

Author: Panneerselvam John
Publication venue: University of Derby
Publication date: 01/01/2018
Field of study

Given the evolution of Cloud Computing in recent years, users and clients adopting Cloud Computing for both personal and business needs have increased at an unprecedented scale. This has naturally led to the increased deployments and implementations of Cloud datacentres across the globe. As a consequence of this increasing adoption of Cloud Computing, Cloud datacentres are witnessed to be massive energy consumers and environmental polluters. Whilst the energy implications of Cloud datacentres are being addressed from various research perspectives, predicting the future trend and behaviours of workloads at the datacentres thereby reducing the active server resources is one particular dimension of green computing gaining the interests of researchers and Cloud providers. However, this includes various practical and analytical challenges imposed by the increased dynamism of Cloud systems. The behavioural characteristics of Cloud workloads and users are still not perfectly clear which restrains the reliability of the prediction accuracy of existing research works in this context. To this end, this thesis presents a comprehensive descriptive analytics of Cloud workload and user behaviours, uncovering the cause and energy related implications of Cloud Computing. Furthermore, the characteristics of Cloud workloads and users including latency levels, job heterogeneity, user dynamicity, straggling task behaviours, energy implications of stragglers, job execution and termination patterns and the inherent periodicity among Cloud workload and user behaviours have been empirically presented. Driven by descriptive analytics, a novel user behaviour forecasting framework has been developed, aimed at a tri-fold forecast of user behaviours including the session duration of users, anticipated number of submissions and the arrival trend of the incoming workloads. Furthermore, a novel resource optimisation framework has been proposed to avail the most optimum level of resources for executing jobs with reduced server energy expenditures and job terminations. This optimisation framework encompasses a resource estimation module to predict the anticipated resource consumption level for the arrived jobs and a classification module to classify tasks based on their resource intensiveness. Both the proposed frameworks have been verified theoretically and tested experimentally based on Google Cloud trace logs. Experimental analysis demonstrates the effectiveness of the proposed framework in terms of the achieved reliability of the forecast results and in reducing the server energy expenditures spent towards executing jobs at the datacentres.N/

UDORA - University of Derby Online Research Archive

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

Author: Ben-Nun Tal
Hoefler Torsten
Publication venue
Publication date: 15/09/2018
Field of study

Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we describe the problem from a theoretical perspective, followed by approaches for its parallelization. We present trends in DNN architectures and the resulting implications on parallelization strategies. We then review and model the different types of concurrency in DNNs: from the single operator, through parallelism in network inference and training, to distributed deep learning. We discuss asynchronous stochastic optimization, distributed system architectures, communication schemes, and neural architecture search. Based on those approaches, we extrapolate potential directions for parallelism in deep learning

arXiv.org e-Print Archive

Repository for Publications and Research Data

Towards Autonomous and Efficient Machine Learning Systems

Author: Ali Ahsan
Publication venue
Publication date: 08/07/2021
Field of study

Computation-intensive machine learning (ML) applications are becoming some of the most popular workloads running atop cloud infrastructure. While training ML applications, practitioners face the challenge of tuning various system-level parameters, such as the number of training nodes, communication topology during training, instance type, and the number of serving nodes, to meet the SLO requirements for bursty workload during the inference. Similarly, efficient resource utilization is another key challenge in cloud computing. This dissertation proposes high-performing and efficient ML systems to speed up training time and inference tasks while enabling automated and robust system management.To train an ML model in a distributed fashion we focus on strategies to mitigate the resource provisioning overhead and improve the training speed without impacting the model accuracy. More specifically, a system for autonomic and adaptive scheduling is built atop serverless computing that dynamically optimizes deployment and resource scaling for ML training tasks for cost-effectiveness and fast training. Similarly, a dynamic client selection framework is developed to address the stragglers problem caused by resource heterogeneity, data quality, and data quantity in a privacy-preserving Federated Learning (FL) environment without impacting the model accuracy.For serving bursty ML workloads we focus on developing highly scalable and adaptive strategies to serve the dynamically changing workload in a cost-effective manner in an autonomic fashion. We develop a framework that optimizes batching parameters on the fly using a lightweight profiler and an analytical model. We also devise strategies for serving ML workloads of varying sizes, leading to non-deterministic service time in a cost-effective manner. More specifically, we develop an SLO-aware framework that first analyzes the request size variations and workload variation to estimate the number of serving functions and intelligently route requests to multiple serving functions. Finally, resource utilization of burstable instances is optimized to benefit the cloud provider and end-user through a careful orchestration of resources (i.e., CPU, network, and I/O) using an analytical model and lightweight profiling, while complying with a user-defined SLO

University of Nevada, Reno ScholarWorks Repository

Multi-Tenant Geo-Distributed Data Analytics

Author: Jonathan Albert
Publication venue
Publication date: 01/07/2019
Field of study

University of Minnesota Ph.D. dissertation. July 2019. Major: Computer Science. Advisors: Abhishek Chandra, Jon Weissman. 1 computer file (PDF); x, 132 pages.Geo-distributed data analytics has gained much interest in recent years due to the need for extracting insights from geo-distributed data. Traditionally, data analytics has been done within a cluster/data center environment. However, analyzing geo-distributed data using existing cluster-based systems typically cannot satisfy the timeliness requirement of most applications and result in wasteful resource consumption due to the fundamental differences of the environments, especially due to the scarce, highly heterogeneous, and dynamic nature of the wide-area resources: compute power and network bandwidth. This thesis addresses the challenges faced by geo-distributed data analytics systems in ensuring high-performance and reliable execution of multiple data analytics applications/queries. Specifically, the focus is on sharing resources across multiple users, applications, and computing frameworks. Sharing resources is attractive as it increases resource utilization and reduces operational cost. However, ensuring high-performance execution of multiple applications in a shared environment is challenging as they may compete for the same resources, especially in a wide-area environment with scarce resources. Furthermore, dynamics such as workload variation, resource variation, stragglers, and failures are inevitable in large-scale distributed systems. These can cause large resource perturbation that significantly affect the performance of query executions. This thesis makes the following contributions. First, we present a resource sharing technique across multiple geo-distributed data analytics frameworks. The main challenge here is how to elastically partition resources while allowing high locality scheduling to each individual framework, which is critical to the execution performance of geo-distributed analytics queries. We then address the problem of how to identify and exploit common executions across multiple queries to mitigate wasteful resource consumption. We demonstrate that traditional multi-query optimization may degrade the overall query execution performance due to its lack of support for network awareness. Finally, we highlight the importance of adaptability in ensuring reliable query execution in the presence of dynamics, both for single and multiple query executions. We propose a systematic approach that can selectively determine which queries to adapt and how to adapt them based on the types of queries, dynamics, and optimization goals

University of Minnesota Digital Conservancy

Recommended from our members

MapReduce based RDF assisted distributed SVM for high throughput spam filtering

Author: Caruana Godwin
Publication venue: Brunel University School of Engineering and Design PhD Theses
Publication date: 01/01/2013
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and was awarded by Brunel UniversityElectronic mail has become cast and embedded in our everyday lives. Billions of legitimate emails are sent on a daily basis. The widely established underlying infrastructure, its widespread availability as well as its ease of use have all acted as catalysts to such pervasive proliferation. Unfortunately, the same can be alleged about unsolicited bulk email, or rather spam. Various methods, as well as enabling architectures are available to try to mitigate spam permeation. In this respect, this dissertation compliments existing survey work in this area by contributing an extensive literature review of traditional and emerging spam filtering approaches. Techniques, approaches and architectures employed for spam filtering are appraised, critically assessing respective strengths and weaknesses. Velocity, volume and variety are key characteristics of the spam challenge. MapReduce (M/R) has become increasingly popular as an Internet scale, data intensive processing platform. In the context of machine learning based spam filter training, support vector machine (SVM) based techniques have been proven effective. SVM training is however a computationally intensive process. In this dissertation, a M/R based distributed SVM algorithm for scalable spam filter training, designated MRSMO, is presented. By distributing and processing subsets of the training data across multiple participating computing nodes, the distributed SVM reduces spam filter training time significantly. To mitigate the accuracy degradation introduced by the adopted approach, a Resource Description Framework (RDF) based feedback loop is evaluated. Experimental results demonstrate that this improves the accuracy levels of the distributed SVM beyond the original sequential counterpart. Effectively exploiting large scale, ‘Cloud’ based, heterogeneous processing capabilities for M/R in what can be considered a non-deterministic environment requires the consideration of a number of perspectives. In this work, gSched, a Hadoop M/R based, heterogeneous aware task to node matching and allocation scheme is designed. Using MRSMO as a baseline, experimental evaluation indicates that gSched improves on the performance of the out-of-the box Hadoop counterpart in a typical Cloud based infrastructure. The focal contribution to knowledge is a scalable, heterogeneous infrastructure and machine learning based spam filtering scheme, able to capitalize on collaborative accuracy improvements through RDF based, end user feedback. MapReduce based RDF Assisted Distributed SVM for High Throughput Spam Filterin

Brunel University Research Archive