890 research outputs found

    A Hybrid Optimization Algorithm for Efficient Virtual Machine Migration and Task Scheduling Using a Cloud-Based Adaptive Multi-Agent Deep Deterministic Policy Gradient Technique

    Get PDF
    This To achieve optimal system performance in the quickly developing field of cloud computing, efficient resource management—which includes accurate job scheduling and optimized Virtual Machine (VM) migration—is essential. The Adaptive Multi-Agent System with Deep Deterministic Policy Gradient (AMS-DDPG) Algorithm is used in this study to propose a cutting-edge hybrid optimization algorithm for effective virtual machine migration and task scheduling. An sophisticated combination of the War Strategy Optimization (WSO) and Rat Swarm Optimizer (RSO) algorithms, the Iterative Concept of War and Rat Swarm (ICWRS) algorithm is the foundation of this technique. Notably, ICWRS optimizes the system with an amazing 93% accuracy, especially for load balancing, job scheduling, and virtual machine migration. The VM migration and task scheduling flexibility and efficiency are greatly improved by the AMS-DDPG technology, which uses a powerful combination of deterministic policy gradient and deep reinforcement learning. By assuring the best possible resource allocation, the Adaptive Multi-Agent System method enhances decision-making even more. Performance in cloud-based virtualized systems is significantly enhanced by our hybrid method, which combines deep learning and multi-agent coordination. Extensive tests that include a detailed comparison with conventional techniques verify the effectiveness of the suggested strategy. As a consequence, our hybrid optimization approach is successful. The findings show significant improvements in system efficiency, shorter job completion times, and optimum resource utilization. Cloud-based systems have unrealized potential for synergistic optimization, as shown by the integration of ICWRS inside the AMS-DDPG framework. Enabling a high-performing and sustainable cloud computing infrastructure that can adapt to the changing needs of modern computing paradigms is made possible by this strategic resource allocation, which is attained via careful computational utilization

    Toward sustainable data centers: a comprehensive energy management strategy

    Get PDF
    Data centers are major contributors to the emission of carbon dioxide to the atmosphere, and this contribution is expected to increase in the following years. This has encouraged the development of techniques to reduce the energy consumption and the environmental footprint of data centers. Whereas some of these techniques have succeeded to reduce the energy consumption of the hardware equipment of data centers (including IT, cooling, and power supply systems), we claim that sustainable data centers will be only possible if the problem is faced by means of a holistic approach that includes not only the aforementioned techniques but also intelligent and unifying solutions that enable a synergistic and energy-aware management of data centers. In this paper, we propose a comprehensive strategy to reduce the carbon footprint of data centers that uses the energy as a driver of their management procedures. In addition, we present a holistic management architecture for sustainable data centers that implements the aforementioned strategy, and we propose design guidelines to accomplish each step of the proposed strategy, referring to related achievements and enumerating the main challenges that must be still solved.Peer ReviewedPostprint (author's final draft

    Workload Prediction for Efficient Performance Isolation and System Reliability

    Get PDF
    In large-scaled and distributed systems, like multi-tier storage systems and cloud data centers, resource sharing among workloads brings multiple benefits while introducing many performance challenges. The key to effective workload multiplexing is accurate workload prediction. This thesis focuses on how to capture the salient characteristics of the real-world workloads to develop workload prediction methods and to drive scheduling and resource allocation policies, in order to achieve efficient and in-time resource isolation among applications. For a multi-tier storage system, high-priority user work is often multiplexed with low-priority background work. This brings the challenge of how to strike a balance between maintaining the user performance and maximizing the amount of finished background work. In this thesis, we propose two resource isolation policies based on different workload prediction methods: one is a Markovian model-based and the other is a neural networks-based. These policies aim at, via workload prediction, discovering the opportune time to schedule background work with minimum impact on user performance. Trace-driven simulations verify the efficiency of the two pro- posed resource isolation policies. The Markovian model-based policy successfully schedules the background work at the appropriate periods with small impact on the user performance. The neural networks-based policy adaptively schedules user and background work, resulting in meeting both performance requirements consistently. This thesis also proposes an accurate while efficient neural networks-based pre- diction method for data center usage series, called PRACTISE. Different from the traditional neural networks for time series prediction, PRACTISE selects the most informative features from the past observations of the time series itself. Testing on a large set of usage series in production data centers illustrates the accuracy (e.g., prediction error) and efficiency (e.g., time cost) of PRACTISE. The superiority of the usage prediction also allows a proactive resource management in the highly virtualized cloud data centers. In this thesis, we analyze on the performance tickets in the cloud data centers, and propose an active sizing algorithm, named ATM, that predicts the usage workloads and re-allocates capacity to work- loads to avoid VM performance tickets. Moreover, driven by cheap prediction of usage tails, we also present TailGuard in this thesis, which dynamically clones VMs among co-located boxes, in order to efficiently reduce the performance violations of physical boxes in cloud data centers

    Dynamic Scaling for Service Oriented Applications: Implications of Virtual Machine Placement on IaaS Clouds

    Get PDF
    Abstraction of physical hardware using infrastructure-as-a-service (IaaS) clouds leads to the simplistic view that resources are homogeneous and that infinite scaling is possible with linear increases in performance. Support for autonomic scaling of multi-tier service oriented applications requires determination of when, what, and where to scale. \u27When\u27 is addressed by hotspot detection schemes using techniques including performance modeling and time series analysis. \u27What\u27 relates to determining the quantity and size of new resources to provision. \u27Where\u27 involves identification of the best location(s) to provision new resources. In this paper we investigate primarily \u27where\u27 new infrastructure should be provisioned, and secondly \u27what\u27 the infrastructure should be. Dynamic scaling of infrastructure for service oriented applications requires rapid response to changes in demand to meet application quality-of-service requirements. We investigate the performance and resource cost implications of VM placement when dynamically scaling server infrastructure of service oriented applications . We evaluate dynamic scaling in the context of providing modeling-as-a-service for two environmental science models

    On Improving The Performance And Resource Utilization of Consolidated Virtual Machines: Measurement, Modeling, Analysis, and Prediction

    Get PDF
    This dissertation addresses the performance related issues of consolidated \emph{Virtual Machines} (VMs). \emph{Virtualization} is an important technology for the \emph{Cloud} and data centers. Essential features of a data center like the fault tolerance, high-availability, and \emph{pay-as-you-go} model of services are implemented with the help of VMs. Cloud had become one of the significant innovations over the past decade. Research has been going on the deployment of newer and diverse set of applications like the \emph{High-Performance Computing} (HPC), and parallel applications on the Cloud. The primary method to increase the server resource utilization is VM consolidation, running as many VMs as possible on a server is the key to improving the resource utilization. On the other hand, consolidating too many VMs on a server can degrade the performance of all VMs. Therefore, it is necessary to measure, analyze and find ways to predict the performance variation of consolidated VMs. This dissertation investigates the causes of performance variation of consolidated VMs; the relationship between the resource contention and consolidation performance, and ways to predict the performance variation. Experiments have been conducted with real virtualized servers without using any simulation. All the results presented here are real system data. In this dissertation, a methodology is introduced to do the experiments with a large number of tasks and VMs; it is called the \emph{Incremental Consolidation Benchmarking Method} (ICBM). The experiments have been done with different types of resource-intensive tasks, parallel workflow, and VMs. Furthermore, to experiment with a large number of VMs and collect the data; a scheduling framework is also designed and implemented. Experimental results are presented to demonstrate the efficiency of the ICBM and framework

    Artificial intelligence driven anomaly detection for big data systems

    Get PDF
    The main goal of this thesis is to contribute to the research on automated performance anomaly detection and interference prediction by implementing Artificial Intelligence (AI) solutions for complex distributed systems, especially for Big Data platforms within cloud computing environments. The late detection and manual resolutions of performance anomalies and system interference in Big Data systems may lead to performance violations and financial penalties. Motivated by this issue, we propose AI-based methodologies for anomaly detection and interference prediction tailored to Big Data and containerized batch platforms to better analyze system performance and effectively utilize computing resources within cloud environments. Therefore, new precise and efficient performance management methods are the key to handling performance anomalies and interference impacts to improve the efficiency of data center resources. The first part of this thesis contributes to performance anomaly detection for in-memory Big Data platforms. We examine the performance of Big Data platforms and justify our choice of selecting the in-memory Apache Spark platform. An artificial neural network-driven methodology is proposed to detect and classify performance anomalies for batch workloads based on the RDD characteristics and operating system monitoring metrics. Our method is evaluated against other popular machine learning algorithms (ML), as well as against four different monitoring datasets. The results prove that our proposed method outperforms other ML methods, typically achieving 98–99% F-scores. Moreover, we prove that a random start instant, a random duration, and overlapped anomalies do not significantly impact the performance of our proposed methodology. The second contribution addresses the challenge of anomaly identification within an in-memory streaming Big Data platform by investigating agile hybrid learning techniques. We develop TRACK (neural neTwoRk Anomaly deteCtion in sparK) and TRACK-Plus, two methods to efficiently train a class of machine learning models for performance anomaly detection using a fixed number of experiments. Our model revolves around using artificial neural networks with Bayesian Optimization (BO) to find the optimal training dataset size and configuration parameters to efficiently train the anomaly detection model to achieve high accuracy. The objective is to accelerate the search process for finding the size of the training dataset, optimizing neural network configurations, and improving the performance of anomaly classification. A validation based on several datasets from a real Apache Spark Streaming system is performed, demonstrating that the proposed methodology can efficiently identify performance anomalies, near-optimal configuration parameters, and a near-optimal training dataset size while reducing the number of experiments up to 75% compared with naïve anomaly detection training. The last contribution overcomes the challenges of predicting completion time of containerized batch jobs and proactively avoiding performance interference by introducing an automated prediction solution to estimate interference among colocated batch jobs within the same computing environment. An AI-driven model is implemented to predict the interference among batch jobs before it occurs within system. Our interference detection model can alleviate and estimate the task slowdown affected by the interference. This model assists the system operators in making an accurate decision to optimize job placement. Our model is agnostic to the business logic internal to each job. Instead, it is learned from system performance data by applying artificial neural networks to establish the completion time prediction of batch jobs within the cloud environments. We compare our model with three other baseline models (queueing-theoretic model, operational analysis, and an empirical method) on historical measurements of job completion time and CPU run-queue size (i.e., the number of active threads in the system). The proposed model captures multithreading, operating system scheduling, sleeping time, and job priorities. A validation based on 4500 experiments based on the DaCapo benchmarking suite was carried out, confirming the predictive efficiency and capabilities of the proposed model by achieving up to 10% MAPE compared with the other models.Open Acces

    Security supportive energy-aware scheduling and energy policies for cloud environments

    Get PDF
    Cloud computing (CC) systems are the most popular computational environments for providing elastic and scalable services on a massive scale. The nature of such systems often results in energy-related problems that have to be solved for sustainability, cost reduction, and environment protection. In this paper we defined and developed a set of performance and energy-aware strategies for resource allocation, task scheduling, and for the hibernation of virtual machines. The idea behind this model is to combine energy and performance-aware scheduling policies in order to hibernate those virtual machines that operate in idle state. The efficiency achieved by applying the proposed models has been tested using a realistic large-scale CC system simulator. Obtained results show that a balance between low energy consumption and short makespan can be achieved. Several security constraints may be considered in this model. Each security constraint is characterized by: (a) Security Demands (SD) of tasks; and (b) Trust Levels (TL) provided by virtual machines. SD and TL are computed during the scheduling process in order to provide proper security services. Experimental results show that the proposed solution reduces up to 45% of the energy consumption of the CC system. Such significant improvement was achieved by the combination of an energy-aware scheduler with energy-efficiency policies focused on the hibernation of VMs.COST Action IC140

    Energy and performance-aware scheduling and shut-down models for efficient cloud-computing data centers.

    Get PDF
    This Doctoral Dissertation, presented as a set of research contributions, focuses on resource efficiency in data centers. This topic has been faced mainly by the development of several energy-efficiency, resource managing and scheduling policies, as well as the simulation tools required to test them in realistic cloud computing environments. Several models have been implemented in order to minimize energy consumption in Cloud Computing environments. Among them: a) Fifteen probabilistic and deterministic energy-policies which shut-down idle machines; b) Five energy-aware scheduling algorithms, including several genetic algorithm models; c) A Stackelberg game-based strategy which models the concurrency between opposite requirements of Cloud-Computing systems in order to dynamically apply the most optimal scheduling algorithms and energy-efficiency policies depending on the environment; and d) A productive analysis on the resource efficiency of several realistic cloud–computing environments. A novel simulation tool called SCORE, able to simulate several data-center sizes, machine heterogeneity, security levels, workload composition and patterns, scheduling strategies and energy-efficiency strategies, was developed in order to test these strategies in large-scale cloud-computing clusters. As results, more than fifty Key Performance Indicators (KPI) show that more than 20% of energy consumption can be reduced in realistic high-utilization environments when proper policies are employed.Esta Tesis Doctoral, que se presenta como compendio de artículos de investigación, se centra en la eficiencia en la utilización de los recursos en centros de datos de internet. Este problema ha sido abordado esencialmente desarrollando diferentes estrategias de eficiencia energética, gestión y distribución de recursos, así como todas las herramientas de simulación y análisis necesarias para su validación en entornos realistas de Cloud Computing. Numerosas estrategias han sido desarrolladas para minimizar el consumo energético en entornos de Cloud Computing. Entre ellos: 1. Quince políticas de eficiencia energética, tanto probabilísticas como deterministas, que apagan máquinas en estado de espera siempre que sea posible; 2. Cinco algoritmos de distribución de tareas que tienen en cuenta el consumo energético, incluyendo varios modelos de algoritmos genéticos; 3. Una estrategia basada en la teoría de juegos de Stackelberg que modela la competición entre diferentes partes de los centros de datos que tienen objetivos encontrados. Este modelo aplica dinámicamente las estrategias de distribución de tareas y las políticas de eficiencia energética dependiendo de las características del entorno; y 4. Un análisis productivo sobre la eficiencia en la utilización de recursos en numerosos escenarios de Cloud Computing. Una nueva herramienta de simulación llamada SCORE se ha desarrollado para analizar las estrategias antes mencionadas en clústers de Cloud Computing de grandes dimensiones. Los resultados obtenidos muestran que se puede conseguir un ahorro de energía superior al 20% en entornos realistas de alta utilización si se emplean las estrategias de eficiencia energética adecuadas. SCORE es open source y puede simular diferentes centros de datos con, entre otros muchos, los siguientes parámetros: Tamaño del centro de datos; heterogeneidad de los servidores; tipo, composición y patrones de carga de trabajo, estrategias de distribución de tareas y políticas de eficiencia energética, así como tres gestores de recursos centralizados: Monolítico, Two-level y Shared-state. Como resultados, esta herramienta de simulación arroja más de 50 Key Performance Indicators (KPI) de rendimiento general, de distribucin de tareas y de energía.Premio Extraordinario de Doctorado U

    RT-LM: Uncertainty-Aware Resource Management for Real-Time Inference of Language Models

    Full text link
    Recent advancements in language models (LMs) have gained substantial attentions on their capability to generate human-like responses. Though exhibiting a promising future for various applications such as conversation AI, these LMs face deployment challenges on various devices due to their extreme computational cost and unpredictable inference latency. Such varied inference latency, identified as a consequence of uncertainty intrinsic to the nature of language, can lead to computational inefficiency and degrade the overall performance of LMs, especially under high-traffic workloads. Unfortunately, the bandwidth of these uncertainty sources is extensive, complicating the prediction of latency and the effects emanating from such uncertainties. To understand and mitigate the impact of uncertainty on real-time response-demanding systems, we take the first step to comprehend, quantify and optimize these uncertainty-induced latency performance variations in LMs. Specifically, we present RT-LM, an uncertainty-aware resource management ecosystem for real-time inference of LMs. RT-LM innovatively quantifies how specific input uncertainties, adversely affect latency, often leading to an increased output length. Exploiting these insights, we devise a lightweight yet effective method to dynamically correlate input text uncertainties with output length at runtime. Utilizing this quantification as a latency heuristic, we integrate the uncertainty information into a system-level scheduler which explores several uncertainty-induced optimization opportunities, including uncertainty-aware prioritization, dynamic consolidation, and strategic CPU offloading. Quantitative experiments across five state-of-the-art LMs on two hardware platforms demonstrates that RT-LM can significantly reduce the average response time and improve throughput while incurring a rather small runtime overhead.Comment: Accepted by RTSS 202
    • …
    corecore