554 research outputs found

    Shutdown Policies with Power Capping for Large Scale Computing Systems

    Get PDF
    International audienceLarge scale distributed systems are expected to consume huge amounts of energy. To solve this issue, shutdown policies constitute an appealing approach able to dynamically adapt the resource set to the actual workload. However, multiple constraints have to be taken into account for such policies to be applied on real infrastructures, in particular the time and energy cost of shutting down and waking up nodes, and power capping to avoid disruption of the system. In this paper, we propose models translating these various constraints into different shutdown policies that can be combined. Our models are validated through simulations on real workload traces and power measurements on real testbeds.

    Predicting system-level power for a hybrid supercomputer

    Get PDF
    For current High Performance Computing systems to scale towards the holy grail of ExaFLOP performance, their power consumption has to be reduced by at least one order of magnitude. This goal can be achieved only through a combination of hardware and software advances. Being able to model and accurately predict the power consumption of large computational systems is necessary for software-level innovations such as proactive and power-aware scheduling, resource allocation and fault tolerance techniques. In this paper we present a 2-layer model of power consumption for a hybrid supercomputer (which held the top spot of the Green500 list on July 2013) that combines CPU, GPU and MIC technologies to achieve higher energy efficiency. Our model takes as input workload information - the number and location of resources that are used by each job at a certain time - and calculates the resulting system-level power consumption. When jobs are submitted to the system, the workload configuration can be foreseen based on the scheduler policies, and our model can then be applied to predict the ensuing system-level power consumption. Additionally, alternative workload configurations can be evaluated from a power perspective and more efficient ones can be selected. Applications of the model include not only power-aware scheduling but also prediction of anomalous behavior

    Interference of billing and scheduling strategies for energy and cost savings in modern data centers

    Get PDF
    The high energy consumption of HPC systems is an obstacle for evergrowing systems. Unfortunately, energy consumption does not decrease linearly with reduced workload; therefore, energy conservation techniques have been deployed on various levels which steer the overall system. While the overall saving of energy is useful, the price of energy is not necessarily proportional to the consumption. Particularly with renewable energies, there are occasions in which the price is significantly lower. The potential of saving energy costs when using smart contracts with energy providers is lacking research. In this paper, we conduct an analysis of the potential savings when applying cost-aware schedulers to data center workloads while considering power contracts that allow for dynamic (hourly) pricing. The contributions of this paper are twofold: 1) the theoretic assessment of cost savings; 2) the development of a simulator to replay batch scheduler traces which supports flexible energy cost models and various cost-aware scheduling algorithms. This allows to approximate the energy costs savings of data centers for various scenarios including off-peak and hourly budgeted energy prices as provided by the energy spot market. An evaluation is conducted with four annual job traces from the German Climate Computing Center (DKRZ) and Leibniz Supercomputing Centre (LRZ)

    Cloud management

    Get PDF

    Investigating Emerging Security Threats in Clouds and Data Centers

    Get PDF
    Data centers have been growing rapidly in recent years to meet the surging demand of cloud services. However, the expanding scale of a data center also brings new security threats. This dissertation studies emerging security issues in clouds and data centers from different aspects, including low-level cooling infrastructures and different virtualization techniques such as container and virtual machine (VM). We first unveil a new vulnerability called reduced cooling redundancy that might be exploited to launch thermal attacks, resulting in severely worsened thermal conditions in a data center. Such a vulnerability is caused by the wide adoption of aggressive cooling energy saving policies. We conduct thermal measurements and uncover effective thermal attack vectors at the server, rack, and data center levels. We also present damage assessments of thermal attacks. Our results demonstrate that thermal attacks can negatively impact the thermal conditions and reliability of victim servers, significantly raise the cooling cost, and even lead to cooling failures. Finally, we propose effective defenses to mitigate thermal attacks. We then perform a systematic study to understand the security implications of the information leakage in multi-tenancy container cloud services. Due to the incomplete implementation of system resource isolation mechanisms in the Linux kernel, a spectrum of system-wide host information is exposed to the containers, including host-system state information and individual process execution information. By exploiting such leaked host information, malicious adversaries can easily launch advanced attacks that can seriously affect the reliability of cloud services. Additionally, we discuss the root causes of the containers\u27 information leakage and propose a two-stage defense approach. The experimental results show that our defense is effective and incurs trivial performance overhead. Finally, we investigate security issues in the existing VM live migration approaches, especially the post-copy approach. While the entire live migration process relies upon reliable TCP connectivity for the transfer of the VM state, we demonstrate that the loss of TCP reliability leads to VM live migration failure. By intentionally aborting the TCP connection, attackers can cause unrecoverable memory inconsistency for post-copy, significantly increase service downtime, and degrade the running VM\u27s performance. From the offensive side, we present detailed techniques to reset the migration connection under heavy networking traffic. From the defensive side, we also propose effective protection to secure the live migration procedure

    Understanding Security Threats in Cloud

    Get PDF
    As cloud computing has become a trend in the computing world, understanding its security concerns becomes essential for improving service quality and expanding business scale. This dissertation studies the security issues in a public cloud from three aspects. First, we investigate a new threat called power attack in the cloud. Second, we perform a systematical measurement on the public cloud to understand how cloud vendors react to existing security threats. Finally, we propose a novel technique to perform data reduction on audit data to improve system capacity, and hence helping to enhance security in cloud. In the power attack, we exploit various attack vectors in platform as a service (PaaS), infrastructure as a service (IaaS), and software as a service (SaaS) cloud environments. to demonstrate the feasibility of launching a power attack, we conduct series of testbed based experiments and data-center-level simulations. Moreover, we give a detailed analysis on how different power management methods could affect a power attack and how to mitigate such an attack. Our experimental results and analysis show that power attacks will pose a serious threat to modern data centers and should be taken into account while deploying new high-density servers and power management techniques. In the measurement study, we mainly investigate how cloud vendors have reacted to the co-residence threat inside the cloud, in terms of Virtual Machine (VM) placement, network management, and Virtual Private Cloud (VPC). Specifically, through intensive measurement probing, we first profile the dynamic environment of cloud instances inside the cloud. Then using real experiments, we quantify the impacts of VM placement and network management upon co-residence, respectively. Moreover, we explore VPC, which is a defensive service of Amazon EC2 for security enhancement, from the routing perspective. Advanced Persistent Threat (APT) is a serious cyber-threat, cloud vendors are seeking solutions to ``connect the suspicious dots\u27\u27 across multiple activities. This requires ubiquitous system auditing for long period of time, which in turn causes overwhelmingly large amount of system audit logs. We propose a new approach that exploits the dependency among system events to reduce the number of log entries while still supporting high quality forensics analysis. In particular, we first propose an aggregation algorithm that preserves the event dependency in data reduction to ensure high quality of forensic analysis. Then we propose an aggressive reduction algorithm and exploit domain knowledge for further data reduction. We conduct a comprehensive evaluation on real world auditing systems using more than one-month log traces to validate the efficacy of our approach

    Adaptive Resource and Job Management for Limited Power Consumption

    Get PDF
    International audienceThe last decades have been characterized by anever growing requirement in terms of computing and storage resources.This tendency has recently put the pressure on the abilityto efficiently manage the power required to operate the hugeamount of electrical components associated with state-of-the-arthigh performance computing systems. The power consumption ofa supercomputer needs to be adjusted based on varying powerbudget or electricity availabilities. As a consequence, Resourceand Job Management Systems have to be adequately adaptedin order to efficiently schedule jobs with optimized performancewhile limiting power usage whenever needed.We introduce in this paper a new scheduling strategy thatcan adapt the executed workload to a limited power budget. Theoriginality of this approach relies upon a combination of speedscaling and node shutdown techniques for power reductions. It isimplemented into the widely used resource and job managementsystem SLURM. Finally, it is validated through large scale emulationsusing real production workload traces of the supercomputerCurie

    Energy Concerns with HPC Systems and Applications

    Full text link
    For various reasons including those related to climate changes, {\em energy} has become a critical concern in all relevant activities and technical designs. For the specific case of computer activities, the problem is exacerbated with the emergence and pervasiveness of the so called {\em intelligent devices}. From the application side, we point out the special topic of {\em Artificial Intelligence}, who clearly needs an efficient computing support in order to succeed in its purpose of being a {\em ubiquitous assistant}. There are mainly two contexts where {\em energy} is one of the top priority concerns: {\em embedded computing} and {\em supercomputing}. For the former, power consumption is critical because the amount of energy that is available for the devices is limited. For the latter, the heat dissipated is a serious source of failure and the financial cost related to energy is likely to be a significant part of the maintenance budget. On a single computer, the problem is commonly considered through the electrical power consumption. This paper, written in the form of a survey, we depict the landscape of energy concerns in computer activities, both from the hardware and the software standpoints.Comment: 20 page

    Towards Mitigating Co-incident Peak Power Consumption and Managing Energy Utilization in Heterogeneous Clusters

    Get PDF
    As data centers continue to grow in scale, the resource management software needs to work closely with the hardware infrastructure to provide high utilization, performance, fault tolerance, and high availability. Apache Mesos has emerged as a leader in this space, providing an abstraction over the entire cluster, data center, or cloud to present a uniform view of all the resources. In addition, frameworks built on Mesos such as Apache Aurora, developed within Twitter and later contributed to the Apache Software Foundation, allow massive job submissions with heterogeneous resource requirements. The availability of such tools in the Open Source space, with proven record of large-scale production use, make them suitable for research on how they can be adapted for use in campus-clusters and emerging cloud infrastructures for different workloads in both academia and industry. As data centers run these workloads and strive to maintain high utilization of their components, they suffer a significant cost in terms of energy and power consumption. To address this cost we have developed our own framework, Electron, for use with Mesos. Electron is designed to be configurable with heuristic-driven power capping policies along with different scheduling policies such as Bin Packing and First Fit. We characterize the performance of Electron, in comparison with the widely used Aurora framework. On average, our experiments show that Electron can reduce the 95th percentile of CPU and DRAM power usage by 27.89%, total energy consumption by 19.15%, average power consumption by 27.90%, and max peak power usage by 16.91%, while maintaining a similar makespan when compared to Aurora using the proper combination of power capping and scheduling policies
    • 

    corecore