1,097 research outputs found

    QoS-aware Storage Virtualization: A Framework for Multi-tier Infrastructures in Cloud Storage Systems

    Get PDF
    The emergence of the relatively modern phenomenon of cloud computing has manifested a different approach to the availability and storage of software and data on a remote online server ‘in the cloud’, which can be accessed by pre-determined users through the Internet, even allowing sharing of data in certain scenarios. Data availability, reliability, and access performance are three important factors that need to be taken into consideration by cloud providers when designing a high-performance storage system for any organization. Due to the high costs of maintaining and managing multiple local storage systems, it is now considered more applicable to design a virtualized multi-tier storage infrastructure, yet, the existing Quality of Service (QoS) must be guaranteed on the application level within the cloud without ongoing human intervention. Such interference seems necessary since the delivered QoS can vary widely both across and within storage tiers, depending on the access profile of the data. This survey paper encompasses a general framework for the optimal design of a distributed system in order to attain efficient data availability and reliability. To this extent, numerous state-of-the-art technologies and methods have been revised, especially for multi-tiered distributed cloud systems. Moreover, several critical aspects that must be taken into consideration for getting optimal performance of QoS-aware cloud systems are discussed, highlighting some solutions to handle failure situations, and the possible advantages and benefits of QoS. Finally, this papers attempts to argue the possible improvements that have been developed on QoS-aware cloud systems like Q-cloud since 2010, such as any extra attempts been carried forward to make the Q-cloud more adaptable and secure

    Toward sustainable data centers: a comprehensive energy management strategy

    Get PDF
    Data centers are major contributors to the emission of carbon dioxide to the atmosphere, and this contribution is expected to increase in the following years. This has encouraged the development of techniques to reduce the energy consumption and the environmental footprint of data centers. Whereas some of these techniques have succeeded to reduce the energy consumption of the hardware equipment of data centers (including IT, cooling, and power supply systems), we claim that sustainable data centers will be only possible if the problem is faced by means of a holistic approach that includes not only the aforementioned techniques but also intelligent and unifying solutions that enable a synergistic and energy-aware management of data centers. In this paper, we propose a comprehensive strategy to reduce the carbon footprint of data centers that uses the energy as a driver of their management procedures. In addition, we present a holistic management architecture for sustainable data centers that implements the aforementioned strategy, and we propose design guidelines to accomplish each step of the proposed strategy, referring to related achievements and enumerating the main challenges that must be still solved.Peer ReviewedPostprint (author's final draft

    Adaptive runtime techniques for power and resource management on multi-core systems

    Full text link
    Energy-related costs are among the major contributors to the total cost of ownership of data centers and high-performance computing (HPC) clusters. As a result, future data centers must be energy-efficient to meet the continuously increasing computational demand. Constraining the power consumption of the servers is a widely used approach for managing energy costs and complying with power delivery limitations. In tandem, virtualization has become a common practice, as virtualization reduces hardware and power requirements by enabling consolidation of multiple applications on to a smaller set of physical resources. However, administration and management of data center resources have become more complex due to the growing number of virtualized servers installed in data centers. Therefore, designing autonomous and adaptive energy efficiency approaches is crucial to achieve sustainable and cost-efficient operation in data centers. Many modern data centers running enterprise workloads successfully implement energy efficiency approaches today. However, the nature of multi-threaded applications, which are becoming more common in all computing domains, brings additional design and management challenges. Tackling these challenges requires a deeper understanding of the interactions between the applications and the underlying hardware nodes. Although cluster-level management techniques bring significant benefits, node-level techniques provide more visibility into application characteristics, which can then be used to further improve the overall energy efficiency of the data centers. This thesis proposes adaptive runtime power and resource management techniques on multi-core systems. It demonstrates that taking the multi-threaded workload characteristics into account during management significantly improves the energy efficiency of the server nodes, which are the basic building blocks of data centers. The key distinguishing features of this work are as follows: We implement the proposed runtime techniques on state-of-the-art commodity multi-core servers and show that their energy efficiency can be significantly improved by (1) taking multi-threaded application specific characteristics into account while making resource allocation decisions, (2) accurately tracking dynamically changing power constraints by using low-overhead application-aware runtime techniques, and (3) coordinating dynamic adaptive decisions at various layers of the computing stack, specifically at system and application levels. Our results show that efficient resource distribution under power constraints yields energy savings of up to 24% compared to existing approaches, along with the ability to meet power constraints 98% of the time for a diverse set of multi-threaded applications

    Optimizing egalitarian performance in the side-effects model of colocation for data center resource management

    Full text link
    In data centers, up to dozens of tasks are colocated on a single physical machine. Machines are used more efficiently, but tasks' performance deteriorates, as colocated tasks compete for shared resources. As tasks are heterogeneous, the resulting performance dependencies are complex. In our previous work [18] we proposed a new combinatorial optimization model that uses two parameters of a task - its size and its type - to characterize how a task influences the performance of other tasks allocated to the same machine. In this paper, we study the egalitarian optimization goal: maximizing the worst-off performance. This problem generalizes the classic makespan minimization on multiple processors (P||Cmax). We prove that polynomially-solvable variants of multiprocessor scheduling are NP-hard and hard to approximate when the number of types is not constant. For a constant number of types, we propose a PTAS, a fast approximation algorithm, and a series of heuristics. We simulate the algorithms on instances derived from a trace of one of Google clusters. Algorithms aware of jobs' types lead to better performance compared with algorithms solving P||Cmax. The notion of type enables us to model degeneration of performance caused by using standard combinatorial optimization methods. Types add a layer of additional complexity. However, our results - approximation algorithms and good average-case performance - show that types can be handled efficiently.Comment: Author's version of a paper published in Euro-Par 2017 Proceedings, extends the published paper with addtional results and proof

    DQN-based intelligent controller for multiple edge domains

    Get PDF
    Advanced technologies like network function virtualization (NFV) and multi-access edge computing (MEC) have been used to build flexible, highly programmable, and autonomously manageable infrastructures close to the end-users, at the edge of the network. In this vein, the use of single-board computers (SBCs) in commodity clusters has gained attention to deploy virtual network functions (VNFs) due to their low cost, low energy consumption, and easy programmability. This paper deals with the problem of deploying VNFs in a multi-cluster system formed by this kind of node which is characterized by limited computational and battery capacities. Additionally, existing platforms to orchestrate and manage VNFs do not consider energy levels during their placement decisions, and therefore, they are not optimized for energy-constrained environments. In this regard, this study proposes an intelligent controller as a global allocation mechanism based on deep reinforcement learning (DRL), specifically on deep Q-network (DQN). The conceived mechanism optimizes energy consumption in SBCs by selecting the most suitable nodes across several clusters to deploy event requests in terms of nodes’ resources and events’ demands. A comparison with available allocation algorithms revealed that our solution required 28% fewer resource costs and reduced 35% the energy consumption in the clusters’ computing nodes while maintaining high levels of acceptance ratio.This work has been supported in part (50%) by the Agencia Estatal de Investigación of Ministerio de Ciencia e Innovación of Spain under projects PID2019-108713RB-C51 & PID2019-108713RB-C52 MCIN/ AEI/10.13039/501100011033; and in part (50%) by AI@EDGE H2020-ICT-52-2020 under grant agreement No. 10101592

    Cloud Workload Allocation Approaches for Quality of Service Guarantee and Cybersecurity Risk Management

    Get PDF
    It has become a dominant trend in industry to adopt cloud computing --thanks to its unique advantages in flexibility, scalability, elasticity and cost efficiency -- for providing online cloud services over the Internet using large-scale data centers. In the meantime, the relentless increase in demand for affordable and high-quality cloud-based services, for individuals and businesses, has led to tremendously high power consumption and operating expense and thus has posed pressing challenges on cloud service providers in finding efficient resource allocation policies. Allowing several services or Virtual Machines (VMs) to commonly share the cloud\u27s infrastructure enables cloud providers to optimize resource usage, power consumption, and operating expense. However, servers sharing among users and VMs causes performance degradation and results in cybersecurity risks. Consequently, how to develop efficient and effective resource management policies to make the appropriate decisions to optimize the trade-offs among resource usage, service quality, and cybersecurity loss plays a vital role in the sustainable future of cloud computing. In this dissertation, we focus on cloud workload allocation problems for resource optimization subject to Quality of Service (QoS) guarantee and cybersecurity risk constraints. To facilitate our research, we first develop a cloud computing prototype that we utilize to empirically validate the performance of different proposed cloud resource management schemes under a close to practical, but also isolated and well-controlled, environment. We then focus our research on the resource management policies for real-time cloud services with QoS guarantee. Based on queuing model with reneging, we establish and formally prove a series of fundamental principles, between service timing characteristics and their resource demands, and based on which we develop several novel resource management algorithms that statically guarantee the QoS requirements for cloud users. We then study the problem of mitigating cybersecurity risk and loss in cloud data centers via cloud resource management. We employ game theory to model the VM-to-VM interdependent cybersecurity risks in cloud clusters. We then conduct a thorough analysis based on our game-theory-based model and develop several algorithms for cybersecurity risk management. Specifically, we start our cybersecurity research from a simple case with only two types of VMs and next extend it to a more general case with an arbitrary number of VM types. Our intensive numerical and experimental results show that our proposed algorithms can significantly outperform the existing methodologies for large-scale cloud data centers in terms of resource usage, cybersecurity loss, and computational effectiveness

    Effective Cache Apportioning for Performance Isolation Under Compiler Guidance

    Full text link
    With a growing number of cores in modern high-performance servers, effective sharing of the last level cache (LLC) is more critical than ever. The primary agenda of such systems is to maximize performance by efficiently supporting multi-tenancy of diverse workloads. However, this could be particularly challenging to achieve in practice, because modern workloads exhibit dynamic phase behaviour, which causes their cache requirements & sensitivities to vary at finer granularities during execution. Unfortunately, existing systems are oblivious to the application phase behavior, and are unable to detect and react quickly enough to these rapidly changing cache requirements, often incurring significant performance degradation. In this paper, we propose Com-CAS, a new apportioning system that provides dynamic cache allocations for co-executing applications. Com-CAS differs from the existing cache partitioning systems by adapting to the dynamic cache requirements of applications just-in-time, as opposed to reacting, without any hardware modifications. The front-end of Com-CAS consists of compiler-analysis equipped with machine learning mechanisms to predict cache requirements, while the back-end consists of proactive scheduler that dynamically apportions LLC amongst co-executing applications leveraging Intel Cache Allocation Technology (CAT). Com-CAS's partitioning scheme utilizes the compiler-generated information across finer granularities to predict the rapidly changing dynamic application behaviors, while simultaneously maintaining data locality. Our experiments show that Com-CAS improves average weighted throughput by 15% over unpartitioned cache system, and outperforms state-of-the-art partitioning system KPart by 20%, while maintaining the worst individual application completion time degradation to meet various Service-Level Agreement (SLA) requirements

    Resource management for extreme scale high performance computing systems in the presence of failures

    Get PDF
    2018 Summer.Includes bibliographical references.High performance computing (HPC) systems, such as data centers and supercomputers, coordinate the execution of large-scale computation of applications over tens or hundreds of thousands of multicore processors. Unfortunately, as the size of HPC systems continues to grow towards exascale complexities, these systems experience an exponential growth in the number of failures occurring in the system. These failures reduce performance and increase energy use, reducing the efficiency and effectiveness of emerging extreme-scale HPC systems. Applications executing in parallel on individual multicore processors also suffer from decreased performance and increased energy use as a result of applications being forced to share resources, in particular, the contention from multiple application threads sharing the last-level cache causes performance degradation. These challenges make it increasingly important to characterize and optimize the performance and behavior of applications that execute in these systems. To address these challenges, in this dissertation we propose a framework for intelligently characterizing and managing extreme-scale HPC system resources. We devise various techniques to mitigate the negative effects of failures and resource contention in HPC systems. In particular, we develop new HPC resource management techniques for intelligently utilizing system resources through the (a) optimal scheduling of applications to HPC nodes and (b) the optimal configuration of fault resilience protocols. These resource management techniques employ information obtained from historical analysis as well as theoretical and machine learning methods for predictions. We use these data to characterize system performance, energy use, and application behavior when operating under the uncertainty of performance degradation from both system failures and resource contention. We investigate how to better characterize and model the negative effects from system failures as well as application co-location on large-scale HPC computing systems. Our analysis of application and system behavior also investigates: the interrelated effects of network usage of applications and fault resilience protocols; checkpoint interval selection and its sensitivity to system parameters for various checkpoint-based fault resilience protocols; and performance comparisons of various promising strategies for fault resilience in exascale-sized systems

    Monitoring and analysis system for performance troubleshooting in data centers

    Get PDF
    It was not long ago. On Christmas Eve 2012, a war of troubleshooting began in Amazon data centers. It started at 12:24 PM, with an mistaken deletion of the state data of Amazon Elastic Load Balancing Service (ELB for short), which was not realized at that time. The mistake first led to a local issue that a small number of ELB service APIs were affected. In about six minutes, it evolved into a critical one that EC2 customers were significantly affected. One example was that Netflix, which was using hundreds of Amazon ELB services, was experiencing an extensive streaming service outage when many customers could not watch TV shows or movies on Christmas Eve. It took Amazon engineers 5 hours 42 minutes to find the root cause, the mistaken deletion, and another 15 hours and 32 minutes to fully recover the ELB service. The war ended at 8:15 AM the next day and brought the performance troubleshooting in data centers to world’s attention. As shown in this Amazon ELB case.Troubleshooting runtime performance issues is crucial in time-sensitive multi-tier cloud services because of their stringent end-to-end timing requirements, but it is also notoriously difficult and time consuming. To address the troubleshooting challenge, this dissertation proposes VScope, a flexible monitoring and analysis system for online troubleshooting in data centers. VScope provides primitive operations which data center operators can use to troubleshoot various performance issues. Each operation is essentially a series of monitoring and analysis functions executed on an overlay network. We design a novel software architecture for VScope so that the overlay networks can be generated, executed and terminated automatically, on-demand. From the troubleshooting side, we design novel anomaly detection algorithms and implement them in VScope. By running anomaly detection algorithms in VScope, data center operators are notified when performance anomalies happen. We also design a graph-based guidance approach, called VFocus, which tracks the interactions among hardware and software components in data centers. VFocus provides primitive operations by which operators can analyze the interactions to find out which components are relevant to the performance issue. VScope’s capabilities and performance are evaluated on a testbed with over 1000 virtual machines (VMs). Experimental results show that the VScope runtime negligibly perturbs system and application performance, and requires mere seconds to deploy monitoring and analytics functions on over 1000 nodes. This demonstrates VScope’s ability to support fast operation and online queries against a comprehensive set of application to system/platform level metrics, and a variety of representative analytics functions. When supporting algorithms with high computation complexity, VScope serves as a ‘thin layer’ that occupies no more than 5% of their total latency. Further, by using VFocus, VScope can locate problematic VMs that cannot be found via solely application-level monitoring, and in one of the use cases explored in the dissertation, it operates with levels of perturbation of over 400% less than what is seen for brute-force and most sampling-based approaches. We also validate VFocus with real-world data center traces. The experimental results show that VFocus has troubleshooting accuracy of 83% on average.Ph.D
    • …
    corecore