354 research outputs found

    Adaptive Speculation for Efficient Internetware Application Execution in Clouds

    Get PDF
    Modern Cloud computing systems are massive in scale, featuring environments that can execute highly dynamic Internetware applications with huge numbers of interacting tasks. This has led to a substantial challenge the straggler problem, whereby a small subset of slow tasks significantly impede parallel job completion. This problem results in longer service responses, degraded system performance, and late timing failures that can easily threaten Quality of Service (QoS) compliance. Speculative execution (or speculation) is the prominent method deployed in Clouds to tolerate stragglers by creating task replicas at runtime. The method detects stragglers by specifying a predefined threshold to calculate the difference between individual tasks and the average task progression within a job. However, such a static threshold debilitates speculation effectiveness as it fails to capture the intrinsic diversity of timing constraints in Internetware applications, as well as dynamic environmental factors such as resource utilization. By considering such characteristics, different levels of strictness for replica creation can be imposed to adaptively achieve specified levels of QoS for different applications. In this paper we present an algorithm to improve the execution efficiency of Internetware applications by dynamically calculating the straggler threshold, considering key parameters including job QoS timing constraints, task execution progress, and optimal system resource utilization. We implement this dynamic straggler threshold into the YARN architecture to evaluate it’s effectiveness against existing state-of-the-art solutions. Results demonstrate that the proposed approach is capable of reducing parallel job response times by up to 20% compared to the static threshold, as well as a higher speculation success rate, achieving up to 66.67% against 16.67% in comparison to the static method

    SLA-Driven Cloud Computing Domain Representation and Management

    Full text link
    The assurance of Quality of Service (QoS) to the applications, although identified as a key feature since long ago [1], is one of the fundamental challenges that remain unsolved. In the Cloud Computing context, Quality of Service is defined as the measure of the compliance of certain user requirement in the delivery of a cloud resource, such as CPU or memory load for a virtual machine, or more abstract and higher level concepts such as response time or availability. Several research groups, both from academia and industry, have started working on describing the QoS levels that define the conditions under which the service need to be delivered, as well as on developing the necessary means to effectively manage and evaluate the state of these conditions. [2] propose Service Level Agreements (SLAs) as the vehicle for the definition of QoS guarantees, and the provision and management of resources. A Service Level Agreement (SLA) is a formal contract between providers and consumers, which defines the quality of service, the obligations and the guarantees in the delivery of a specific good. In the context of Cloud computing, SLAs are considered to be machine readable documents, which are automatically managed by the provider's platform. SLAs need to be dynamically adapted to the variable conditions of resources and applications. In a multilayer architecture, different parts of an SLA may refer to different resources. SLAs may therefore express complex relationship between entities in a changing environment, and be applied to resource selection to implement intelligent scheduling algorithms. Therefore SLAs are widely regarded as a key feature for the future development of Cloud platforms. However, the application of SLAs for Grid and Cloud systems has many open research lines. One of these challenges, the modeling of the landscape, lies at the core of the objectives of the Ph. D. Thesis.García García, A. (2014). SLA-Driven Cloud Computing Domain Representation and Management [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/36579TESI

    Multi-Level ML Based Burst-Aware Autoscaling for SLO Assurance and Cost Efficiency

    Full text link
    Autoscaling is a technology to automatically scale the resources provided to their applications without human intervention to guarantee runtime Quality of Service (QoS) while saving costs. However, user-facing cloud applications serve dynamic workloads that often exhibit variable and contain bursts, posing challenges to autoscaling for maintaining QoS within Service-Level Objectives (SLOs). Conservative strategies risk over-provisioning, while aggressive ones may cause SLO violations, making it more challenging to design effective autoscaling. This paper introduces BAScaler, a Burst-Aware Autoscaling framework for containerized cloud services or applications under complex workloads, combining multi-level machine learning (ML) techniques to mitigate SLO violations while saving costs. BAScaler incorporates a novel prediction-based burst detection mechanism that distinguishes between predictable periodic workload spikes and actual bursts. When bursts are detected, BAScaler appropriately overestimates them and allocates resources accordingly to address the rapid growth in resource demand. On the other hand, BAScaler employs reinforcement learning to rectify potential inaccuracies in resource estimation, enabling more precise resource allocation during non-bursts. Experiments across ten real-world workloads demonstrate BAScaler's effectiveness, achieving a 57% average reduction in SLO violations and cutting resource costs by 10% compared to other prominent methods

    DeepScaler: Holistic Autoscaling for Microservices Based on Spatiotemporal GNN with Adaptive Graph Learning

    Full text link
    Autoscaling functions provide the foundation for achieving elasticity in the modern cloud computing paradigm. It enables dynamic provisioning or de-provisioning resources for cloud software services and applications without human intervention to adapt to workload fluctuations. However, autoscaling microservice is challenging due to various factors. In particular, complex, time-varying service dependencies are difficult to quantify accurately and can lead to cascading effects when allocating resources. This paper presents DeepScaler, a deep learning-based holistic autoscaling approach for microservices that focus on coping with service dependencies to optimize service-level agreements (SLA) assurance and cost efficiency. DeepScaler employs (i) an expectation-maximization-based learning method to adaptively generate affinity matrices revealing service dependencies and (ii) an attention-based graph convolutional network to extract spatio-temporal features of microservices by aggregating neighbors' information of graph-structural data. Thus DeepScaler can capture more potential service dependencies and accurately estimate the resource requirements of all services under dynamic workloads. It allows DeepScaler to reconfigure the resources of the interacting services simultaneously in one resource provisioning operation, avoiding the cascading effect caused by service dependencies. Experimental results demonstrate that our method implements a more effective autoscaling mechanism for microservice that not only allocates resources accurately but also adapts to dependencies changes, significantly reducing SLA violations by an average of 41% at lower costs.Comment: To be published in the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE 2023

    Dynamic Assembly for System Adaptability, Dependability, and Assurance

    Get PDF
    (DASASA) ProjectAuthor-contributed print ite

    Data Replication and Its Alignment with Fault Management in the Cloud Environment

    Get PDF
    Nowadays, the exponential data growth becomes one of the major challenges all over the world. It may cause a series of negative impacts such as network overloading, high system complexity, and inadequate data security, etc. Cloud computing is developed to construct a novel paradigm to alleviate massive data processing challenges with its on-demand services and distributed architecture. Data replication has been proposed to strategically distribute the data access load to multiple cloud data centres by creating multiple data copies at multiple cloud data centres. A replica-applied cloud environment not only achieves a decrease in response time, an increase in data availability, and more balanced resource load but also protects the cloud environment against the upcoming faults. The reactive fault tolerance strategy is also required to handle the faults when the faults already occurred. As a result, the data replication strategies should be aligned with the reactive fault tolerance strategies to achieve a complete management chain in the cloud environment. In this thesis, a data replication and fault management framework is proposed to establish a decentralised overarching management to the cloud environment. Three data replication strategies are firstly proposed based on this framework. A replica creation strategy is proposed to reduce the total cost by jointly considering the data dependency and the access frequency in the replica creation decision making process. Besides, a cloud map oriented and cost efficiency driven replica creation strategy is proposed to achieve the optimal cost reduction per replica in the cloud environment. The local data relationship and the remote data relationship are further analysed by creating two novel data dependency types, Within-DataCentre Data Dependency and Between-DataCentre Data Dependency, according to the data location. Furthermore, a network performance based replica selection strategy is proposed to avoid potential network overloading problems and to increase the number of concurrent-running instances at the same time

    How to Place Your Apps in the Fog -- State of the Art and Open Challenges

    Full text link
    Fog computing aims at extending the Cloud towards the IoT so to achieve improved QoS and to empower latency-sensitive and bandwidth-hungry applications. The Fog calls for novel models and algorithms to distribute multi-service applications in such a way that data processing occurs wherever it is best-placed, based on both functional and non-functional requirements. This survey reviews the existing methodologies to solve the application placement problem in the Fog, while pursuing three main objectives. First, it offers a comprehensive overview on the currently employed algorithms, on the availability of open-source prototypes, and on the size of test use cases. Second, it classifies the literature based on the application and Fog infrastructure characteristics that are captured by available models, with a focus on the considered constraints and the optimised metrics. Finally, it identifies some open challenges in application placement in the Fog

    Control Strategies for Improving Cloud Service Robustness

    Get PDF
    This thesis addresses challenges in increasing the robustness of cloud-deployed applications and services to unexpected events and dynamic workloads. Without precautions, hardware failures and unpredictable large traffic variations can quickly degrade the performance of an application due to mismatch between provisioned resources and capacity needs. Similarly, disasters, such as power outages and fire, are unexpected events on larger scale that threatens the integrity of the underlying infrastructure on which an application is deployed.First, the self-adaptive software concept of brownout is extended to replicated cloud applications. By monitoring the performance of each application replica, brownout is able to counteract temporary overload situations by reducing the computational complexity of jobs entering the system. To avoid existing load balancers interfering with the brownout functionality, brownout-aware load balancers are introduced. Simulation experiments show that the proposed load balancers outperform existing load balancers in providing a high quality of service to as many end users as possible. Experiments in a testbed environment further show how a replicated brownout-enabled application is able to maintain high performance during overloads as compared to its non-brownout equivalent.Next, a feedback controller for cloud autoscaling is introduced. Using a novel way of modeling the dynamics of typical cloud application, a mechanism similar to the classical Smith predictor to compensate for delays in reconfiguring resource provisioning is presented. Simulation experiments show that the feedback controller is able to achieve faster control of the response times of a cloud application as compared to a threshold-based controller.Finally, a solution for handling the trade-off between performance and disaster tolerance for geo-replicated cloud applications is introduced. An automated mechanism for differentiating application traffic and replication traffic, and dynamically managing their bandwidth allocations using an MPC controller is presented and evaluated in simulation. Comparisons with commonly used static approaches reveal that the proposed solution in overload situations provides increased flexibility in managing the trade-off between performance and data consistency

    Holistic Resource Management for Sustainable and Reliable Cloud Computing:An Innovative Solution to Global Challenge

    Get PDF
    Minimizing the energy consumption of servers within cloud computing systems is of upmost importance to cloud providers towards reducing operational costs and enhancing service sustainability by consolidating services onto fewer active servers. Moreover, providers must also provision high levels of availability and reliability, hence cloud services are frequently replicated across servers that subsequently increases server energy consumption and resource overhead. These two objectives can present a potential conflict within cloud resource management decision making that must balance between service consolidation and replication to minimize energy consumption whilst maximizing server availability and reliability, respectively. In this paper, we propose a cuckoo optimization-based energy-reliability aware resource scheduling technique (CRUZE) for holistic management of cloud computing resources including servers, networks, storage, and cooling systems. CRUZE clusters and executes heterogeneous workloads on provisioned cloud resources and enhances the energy-efficiency and reduces the carbon footprint in datacenters without adversely affecting cloud service reliability. We evaluate the effectiveness of CRUZE against existing state-of-the-art solutions using the CloudSim toolkit. Results indicate that our proposed technique is capable of reducing energy consumption by 20.1% whilst improving reliability and CPU utilization by 17.1% and 15.7% respectively without affecting other Quality of Service parameters