14 research outputs found

    Uncertainty-Aware Workload Prediction in Cloud Computing

    Full text link
    Predicting future resource demand in Cloud Computing is essential for managing Cloud data centres and guaranteeing customers a minimum Quality of Service (QoS) level. Modelling the uncertainty of future demand improves the quality of the prediction and reduces the waste due to overallocation. In this paper, we propose univariate and bivariate Bayesian deep learning models to predict the distribution of future resource demand and its uncertainty. We design different training scenarios to train these models, where each procedure is a different combination of pretraining and fine-tuning steps on multiple datasets configurations. We also compare the bivariate model to its univariate counterpart training with one or more datasets to investigate how different components affect the accuracy of the prediction and impact the QoS. Finally, we investigate whether our models have transfer learning capabilities. Extensive experiments show that pretraining with multiple datasets boosts performances while fine-tuning does not. Our models generalise well on related but unseen time series, proving transfer learning capabilities. Runtime performance analysis shows that the models are deployable in real-world applications. For this study, we preprocessed twelve datasets from real-world traces in a consistent and detailed way and made them available to facilitate the research in this field

    Predicting Scheduling Failures in the Cloud

    Full text link
    Cloud Computing has emerged as a key technology to deliver and manage computing, platform, and software services over the Internet. Task scheduling algorithms play an important role in the efficiency of cloud computing services as they aim to reduce the turnaround time of tasks and improve resource utilization. Several task scheduling algorithms have been proposed in the literature for cloud computing systems, the majority relying on the computational complexity of tasks and the distribution of resources. However, several tasks scheduled following these algorithms still fail because of unforeseen changes in the cloud environments. In this paper, using tasks execution and resource utilization data extracted from the execution traces of real world applications at Google, we explore the possibility of predicting the scheduling outcome of a task using statistical models. If we can successfully predict tasks failures, we may be able to reduce the execution time of jobs by rescheduling failed tasks earlier (i.e., before their actual failing time). Our results show that statistical models can predict task failures with a precision up to 97.4%, and a recall up to 96.2%. We simulate the potential benefits of such predictions using the tool kit GloudSim and found that they can improve the number of finished tasks by up to 40%. We also perform a case study using the Hadoop framework of Amazon Elastic MapReduce (EMR) and the jobs of a gene expression correlations analysis study from breast cancer research. We find that when extending the scheduler of Hadoop with our predictive models, the percentage of failed jobs can be reduced by up to 45%, with an overhead of less than 5 minutes

    Data Center Load Forecast Using Dependent Mixture Model

    Get PDF
    The dependency on cloud computing is increasing day by day. With the boom of data centers, the cost is also increasing, which forces industries to come up with techniques and methodologies to reduce the data center energy use. Load forecasting plays a vital role in both efficient scheduling and operating a data center as a virtual power plant. In this thesis work a stochastic method, based on dependent mixtures is developed to model the data center load and is used for day-ahead forecast. The method is validated using three data sets from National Renewable Energy Laboratory (NREL) and one other data centers. The proposed method proved better than the classical autoregressive, moving-average, as well as the neural network-based forecasting method, and resulted in a reduction of 7.91% mean absolute percentage error (MAPE) for the forecast. A more accurate forecast can improve power scheduling and resource management reducing the variable cost of power generation as well as the overall data center operating cost, which was quantified as a yearly savings of $13,705 for a typical 100 MW coal fired tier-IV data center

    RVLBPNN: A Workload Forecasting Model for Smart Cloud Computing

    Get PDF
    Given the increasing deployments of Cloud datacentres and the excessive usage of server resources, their associated energy and environmental implications are also increasing at an alarming rate. Cloud service providers are under immense pressure to significantly reduce both such implications for promoting green computing. Maintaining the desired level of Quality of Service (QoS) without violating the Service Level Agreement (SLA), whilst attempting to reduce the usage of the datacentre resources is an obvious challenge for the Cloud service providers. Scaling the level of active server resources in accordance with the predicted incoming workloads is one possible way of reducing the undesirable energy consumption of the active resources without affecting the performance quality. To this end, this paper analyzes the dynamic characteristics of the Cloud workloads and defines a hierarchy for the latency sensitivity levels of the Cloud workloads. Further, a novel workload prediction model for energy efficient Cloud Computing is proposed, named RVLBPNN (Rand Variable Learning Rate Backpropagation Neural Network) based on BPNN (Backpropagation Neural Network) algorithm. Experiments evaluating the prediction accuracy of the proposed prediction model demonstrate that RVLBPNN achieves an improved prediction accuracy compared to the HMM and NaĂŻve Bayes Classifier models by a considerable margin

    SLA violation prediction : a machine learning perspective

    Get PDF
    Le cloud computing réduit les coûts de maintenance des services et permet aux utilisateurs d'accéder à la demande aux services sans devoir être impliqués dans des détails techniques d'implémentation. Le lien entre un fournisseur de services cloud et un client est régi par une Validation du Niveau Service (VNS) qui définit pour chaque service le niveau et le coût associé. La VNS contient habituellement des paramètres spécifiques et un niveau minimum de qualité pour chaque élément du service qui est négocié entre les deux parties. Cependant, une ou plusieurs des conditions convenues dans une VNS pourraient être violées en raison de plusieurs problèmes tels que des problèmes techniques occasionnels. Du point de vue d'apprentissage automatique, le problème de la prédiction de violation de la VNS équivaut à un problème de classification binaire. Nous avons exploré deux modèles de classification en apprentissage automatique lors de cette thèse. Il s’agit des modèles de classification de Bayes naïve et de Forêts Aléatoires afin de prédire des violations futures d’une certaine tâche utilisant ses traits caractéristiques. Comparativement aux travaux précédents sur la prédiction d'une violation de la VNS, nos modèles ont été entraînés sur des ensembles de données réels introduisant ainsi de nouveaux défis. Nous avons validé le tout en utilisant Google Cloud Cluster trace comme avec l’ensemble de données. Les violations de la VNS étant des évènements rares 2.2 %, leur classification automatique reste une tâche difficile. Un modèle de classification aura en effet une forte tendance à prédire la classe dominante au détriment des classes rares. Pour répondre à ce problème, il existe plusieurs méthodes de ré-échantillonages telles que Random Over-Sampling, Under-Sampling, SMOTH, NearMiss, One-sided Selection, Neighborhood Cleaning Rule. Il est donc possible de les combiner afin de ré-équilibrer le jeu de données.Cloud computing reduces the maintenance costs of services and allows users to access on demand services without being involved in technical implementation details. The relationship between a cloud provider and a customer is governed with a Service Level Agreement (SLA) that is established to define the level of the service and its associated costs. SLA usually contains specific parameters and a minimum level of quality for each element of the service that is negotiated between a cloud provider and a customer. However, one or more than one of the agreed terms in an SLA might be violated due to several issues such as occasional technical problems. Violations do happen in real world. In terms of availability, Amazon Elastic Cloud faced an outage in 2011 when it crashed and many large customers such as Reddit and Quora were down for more than one day. As SLA violation prediction benefits both user and cloud provider, in recent years, cloud researchers have started investigating models that are capable of prediction future violations. From a Machine Learning point of view, the problem of SLA violation prediction amounts to a binary classification problem. In this thesis, we explore two Machine Learning classification models: Naive Bayes and Random Forest to predict future violations using features of a submitted task. Unlike previous works on SLA violation prediction or avoidance, our models are trained on a real world dataset which introduces new challenges. We validate our models using Google Cloud Cluster trace as the dataset. Since SLA violations are rare events in real world 2.2 %, the classification task becomes more challenging because the classifier will always have the tendency to predict the dominant class. In order to overcome this issue, we use several re-sampling methods such as Random Over-Sampling, Under-Sampling, SMOTH, NearMiss, One-sided Selection, Neighborhood Cleaning Rule and an ensemble of them to re-balance the dataset

    Performance-Predictable Resource Management of Container-based Genetic Algorithm Workloads in Cloud Infrastructure

    Get PDF
    Cloud computing, adopted by major providers like Amazon and Google, offers on-demand, pay-as-you-go services and resources through shared pools. Users submit workloads comprising multiple jobs, each containing tasks, including a specific genetic algorithm (GA) workload detailed in this thesis. This GA workload contains independent tasks from real-time multiprocessor allocation and Sudoku puzzle case studies, each with fixed deadlines and fitness requirements. Effective resource management is critical to enhance the Quality of Service (QoS) for cloud users. It involves resource allocation and adhering to QoS standards, guided by workload specifics. Container orchestration emerges as an essential deployment and management approach. This thesis focuses on managing multiple instances of genetic algorithms (GAs) in a cloud environment to achieve user-defined fitness levels within specified deadlines. It presents various approaches to allocate GAs to cloud nodes and control their execution iteratively. Initially, it introduces approaches such as fitness tracking (FT), fitness prediction (FP), fitness-prediction-based linear regression (FPLR), and fitness prediction based on weighted least squares (FPWLS) for managing the workload. To enhance resource efficiency, the thesis also addresses node interference, allowing multiple tasks to share resources while minimizing their impact on each other. It proposes a weighted-based node interference approach, considering fitness levels and response times during iterations to optimize task allocation. The performance of these approaches was experimentally evaluated by testing two GA applications and comparing them against state-of-the-art container-based orchestration approaches. Thus, different approaches were compared considering the number of successful tasks which can be defined by the number of tasks executed on time and achieved the fitness required. Comparison was also made between different approaches by taking iteration analysis into consideration. In situations where performance prediction was used, prediction errors like Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) were used to evaluate and compare the performance of the prediction approaches

    GPGPU Reliability Analysis: From Applications to Large Scale Systems

    Get PDF
    Over the past decade, GPUs have become an integral part of mainstream high-performance computing (HPC) facilities. Since applications running on HPC systems are usually long-running, any error or failure could result in significant loss in scientific productivity and system resources. Even worse, since HPC systems face severe resilience challenges as progressing towards exascale computing, it is imperative to develop a better understanding of the reliability of GPUs. This dissertation fills this gap by providing an understanding of the effects of soft errors on the entire system and on specific applications. To understand system-level reliability, a large-scale study on GPU soft errors in the field is conducted. The occurrences of GPU soft errors are linked to several temporal and spatial features, such as specific workloads, node location, temperature, and power consumption. Further, machine learning models are proposed to predict error occurrences on GPU nodes so as to proactively and dynamically turning on/off the costly error protection mechanisms based on prediction results. To understand the effects of soft errors at the application level, an effective fault-injection framework is designed aiming to understand the reliability and resilience characteristics of GPGPU applications. This framework is effective in terms of reducing the tremendous number of fault injection locations to a manageable size while still preserving remarkable accuracy. This framework is validated with both single-bit and multi-bit fault models for various GPGPU benchmarks. Lastly, taking advantage of the proposed fault-injection framework, this dissertation develops a hierarchical approach to understanding the error resilience characteristics of GPGPU applications at kernel, CTA, and warp levels. In addition, given that some corrupted application outputs due to soft errors may be acceptable, we present a use case to show how to enable low-overhead yet reliable GPU computing for GPGPU applications

    A prescriptive analytics approach for energy efficiency in datacentres.

    Get PDF
    Given the evolution of Cloud Computing in recent years, users and clients adopting Cloud Computing for both personal and business needs have increased at an unprecedented scale. This has naturally led to the increased deployments and implementations of Cloud datacentres across the globe. As a consequence of this increasing adoption of Cloud Computing, Cloud datacentres are witnessed to be massive energy consumers and environmental polluters. Whilst the energy implications of Cloud datacentres are being addressed from various research perspectives, predicting the future trend and behaviours of workloads at the datacentres thereby reducing the active server resources is one particular dimension of green computing gaining the interests of researchers and Cloud providers. However, this includes various practical and analytical challenges imposed by the increased dynamism of Cloud systems. The behavioural characteristics of Cloud workloads and users are still not perfectly clear which restrains the reliability of the prediction accuracy of existing research works in this context. To this end, this thesis presents a comprehensive descriptive analytics of Cloud workload and user behaviours, uncovering the cause and energy related implications of Cloud Computing. Furthermore, the characteristics of Cloud workloads and users including latency levels, job heterogeneity, user dynamicity, straggling task behaviours, energy implications of stragglers, job execution and termination patterns and the inherent periodicity among Cloud workload and user behaviours have been empirically presented. Driven by descriptive analytics, a novel user behaviour forecasting framework has been developed, aimed at a tri-fold forecast of user behaviours including the session duration of users, anticipated number of submissions and the arrival trend of the incoming workloads. Furthermore, a novel resource optimisation framework has been proposed to avail the most optimum level of resources for executing jobs with reduced server energy expenditures and job terminations. This optimisation framework encompasses a resource estimation module to predict the anticipated resource consumption level for the arrived jobs and a classification module to classify tasks based on their resource intensiveness. Both the proposed frameworks have been verified theoretically and tested experimentally based on Google Cloud trace logs. Experimental analysis demonstrates the effectiveness of the proposed framework in terms of the achieved reliability of the forecast results and in reducing the server energy expenditures spent towards executing jobs at the datacentres.N/
    corecore