9 research outputs found

    Exploiting Field Data Analysis to Improve the Reliability and Energy-efficiency of HPC Systems

    No full text
    As the scale of High-Performance Computing (HPC) clusters continues to grow, their increasing failure rates and energy consumption levels are emerging as two serious design concerns that are expected to become more challenging in future Exascale systems. The efficient design and operation of such large-scale installations critically relies on developing an in-depth understanding of their failure behaviour as well as their energy consumption profiles. Among the main obstacles facing the study of HPC reliability and energy efficiency issues, however, is the difficulty of replicating HPC problems inside a lab environment or obtaining access to operational field data from HPC organizations. Examples of such field data include node failure logs, hardware replacement logs, system event logs, workload traces, data from environmental sensors, and more. Fortunately, the recent decade has witnessed an increasing number of HPC organizations willing to share their operational data with researchers or even make them publicly available. In this work, we exploit field data analysis in improving our understanding of HPC failures in real world systems, and in optimizing HPC fault-tolerance protocols while analyzing their respective performance and energy overheads. Throughout our analyses, we investigate various HPC design tradeoffs between system performance, system reliability, and energy efficiency. Our results in the first part of this thesis provide critical insights into how and why failures happen in HPC installations as well as which types of failures are correlated in the field. We study the impact of various factors on system reliability, including environmental factors such as data center temperature and power quality. We find that the effect of temperature, for example, on hardware reliability in large-scale systems is smaller than often assumed. This finding implies that the operators of these facilities can achieve high energy savings by raising their operating temperatures, without making significant sacrifices in system reliability. Our analysis of power problems in large HPC facilities, on the other hand, revealed strong correlations between different power issues (e.g. power outages, voltage spikes, etc.), and increased failure rates in various hardware and software components. Based on our observations, we derive learned lessons and practical recommendations for the efficient design and operation of large-scale systems. The second part of this thesis utilizes the knowledge obtained from our HPC failure analysis in improving HPC fault-tolerance techniques. We focus on the most widely used fault-tolerance mechanism in modern HPC systems: "checkpoint/restart". We study how to optimize checkpoint-scheduling in parallel applications for both performance and energy efficiency purposes. Our results show that exploiting certain failure characteristics of HPC systems in designing checkpoint-scheduling policies can reduce the energy/performance overheads that are associated with faults and fault-tolerance in HPC systems significantly.Ph.D

    Understanding object-level memory access patterns across the spectrum

    No full text
    Memory accesses limit the performance and scalability of countless applications. Many design and optimization efforts will benefit from an in-depth understanding of memory access behavior, which is not offered by extant access tracing and profiling methods. In this paper, we adopt a holistic memory access profiling approach to enable a better understanding of program-system memory interactions. We have developed a two-pass tool adopting fast online and slow offline profiling, with which we have profiled, at the variable/object level, a collection of 38 representative applications spanning major domains (HPC, personal computing, data analytics, AI, graph processing, and datacenter workloads), at varying problem sizes. We have performed detailed result analysis and code examination. Our findings provide new insights into application memory behavior, including insights on per-object access patterns, adoption of data structures, and memory-access changes at different problem sizes. We find that scientific computation applications exhibit distinct behaviors compared to datacenter workloads, motivating separate memory system design/optimizations.National Key Basic Research Program of China (Grant 2016YFA0602100)National Science Foundation (China) (Grant No. 91530323)National Research Foundation of Korea (Grant 2015R1C1A1A0152105)United States. Department of Energy (Contract DE-AC05-00OR22725

    Understanding Practical Tradeoffs in HPC Checkpoint-Scheduling Policies

    No full text

    Entropy Optimization of Social Networks Using an Evolutionary Algorithm

    No full text
    Abstract: Recent work on social networks has tackled the measurement and optimization of these networks' robustness and resilience to both failures and attacks. Different metrics have been used to quantitatively measure the robustness of a social network. In this work, we design and apply a Genetic Algorithm that maximizes the cyclic entropy of a social network model, hence optimizing its robustness to failures. Our social network model is a scale-free network created using Barabási and Albert's generative model, since it has been demonstrated recently that many large complex networks display a scale-free structure. We compare the cycles distribution of the optimally robust network generated by our algorithm to that belonging to a fully connected network. Moreover, we optimize the robustness of a scale-free network based on the links-degree entropy, and compare the outcomes to that which is based on cyclesentropy. We show that both cyclic and degree entropy optimization are equivalent and provide the same final optimal distribution. Hence, cyclic entropy optimization is justified in the search for the optimal network distribution

    Entropy Optimization of Social Networks Using an Evolutionary Algorithm

    No full text
    Recent work on social networks has tackled the measurement and optimization of these networks’ robustness and resilience to both failures and attacks. Different metrics have been used to quantitatively measure the robustness of a social network. In this work, we design and apply a Genetic Algorithm that maximizes the cyclic entropy of a social network model, hence optimizing its robustness to failures. Our social network model is a scale-free network created using Barabási and Albert's generative model, since it has been demonstrated recently that many large complex networks display a scale-free structure. We compare the cycles distribution of the optimally robust network generated by our algorithm to that belonging to a fully connected network. Moreover, we optimize the robustness of a scale-free network based on the links-degree entropy, and compare the outcomes to that which is based on cycles-entropy. We show that both cyclic and degree entropy optimization are equivalent and provide the same final optimal distribution. Hence, cyclic entropy optimization is justified in the search for the optimal network distribution

    Temperature Management in Data Centers: Why Some (Might) Like It Hot

    No full text
    The energy consumed by data centers is starting to make up a significant fraction of the world’s energy consumption and carbon emissions. A large fraction of the consumed energy is spent on data center cooling, which has motivated a large body of work on temperature management in data centers. Interestingly, a key aspect of temperature management has not been well understood: controlling the setpoint temperature at which to run a data center’s cooling system. Most data centers set their thermostat based on (conservative) suggestions by manufacturers, as there is limited understanding of how higher temperatures will affect the system. At the same time, studies suggest that increasing the temperature setpoint by just one degree could save 2–5 % of the energy consumption. This paper provides a multi-faceted study of temperature management in data centers. We use a large collection of field data from different production environments to study the impact of temperature on hardware reliability, including the reliability of the storage subsystem, the memory subsystem and server reliability as a whole. We also use an experimental testbed based on a thermal chamber and a large array of benchmarks to study two other potential issues with higher data center temperatures: the effect on server performance and power. Based on our findings, we make recommendations for temperature management in data centers, that create the potential for saving energy, while limiting negative effects on system reliability and performance

    KPart: A Hybrid Cache Partitioning-Sharing Technique for Commodity Multicores

    No full text
    © 2018 IEEE. Cache partitioning is now available in commercial hardware. In theory, software can leverage cache partitioning to use the last-level cache better and improve performance. In practice, however, current systems implement way-partitioning, which offers a limited number of partitions and often hurts performance. These limitations squander the performance potential of smart cache management. We present KPart, a hybrid cache partitioning-sharing technique that sidesteps the limitations of way-partitioning and unlocks significant performance on current systems. KPart first groups applications into clusters, then partitions the cache among these clusters. To build clusters, KPart relies on a novel technique to estimate the performance loss an application suffers when sharing a partition. KPart automatically chooses the number of clusters, balancing the isolation benefits of way-partitioning with its potential performance impact. KPart uses detailed profiling information to make these decisions. This information can be gathered either offline, or online at low overhead using a novel profiling mechanism. We evaluate KPart in a real system and in simulation. KPart improves throughput by 24% on average (up to 79%) on an Intel Broadwell-D system, whereas prior per-application partitioning policies improve throughput by just 1.7% on average and hurt 30% of workloads. Simulation results show that KPart achieves most of the performance of more advanced partitioning techniques that are not yet available in hardware

    Optimising e-Government Data Centre Operations to Minimise Energy Consumption: A Simulation-Based Analytical Approach

    Get PDF
    Part 3: Open Data: Social and Technical AspectsInternational audienceThe energy consumption of data centres is increasing over and over. However, there are few decision support systems introduced for data centre practitioners to use it for their daily operations, including simulating their new server installation and forecasting power consumption for the target periods. We propose a simulation model based on CloudSim, which is widely used for data centre research. Our simulation model will be tested with datasets, including IT work-loads, cooling performance, and power consumption of servers and cooling devices. In the final stage, we provide a decision support system to monitor, forecast, and optimise the power consumption of a data centre easily on the web
    corecore