355 research outputs found

    Resource management in heterogeneous computing systems with tasks of varying importance

    Get PDF
    2014 Summer.The problem of efficiently assigning tasks to machines in heterogeneous computing environments where different tasks can have different levels of importance (or value) to the computing system is a challenging one. The goal of this work is to study this problem in a variety of environments. One part of the study considers a computing system and its corresponding workload based on the expectations for future environments of Department of Energy and Department of Defense interest. We design heuristics to maximize a performance metric created using utility functions. We also create a framework to analyze the trade-offs between performance and energy consumption. We design techniques to maximize performance in a dynamic environment that has a constraint on the energy consumption. Another part of the study explores environments that have uncertainty in the availability of the compute resources. For this part, we design heuristics and compare their performance in different types of environments

    Resource management for heterogeneous computing systems: utility maximization, energy-aware scheduling, and multi-objective optimization

    Get PDF
    Includes bibliographical references.2015 Summer.As high performance heterogeneous computing systems continually become faster, the operating cost to run these systems has increased. A significant portion of the operating costs can be attributed to the amount of energy required for these systems to operate. To reduce these costs it is important for system administrators to operate these systems in an energy efficient manner. Additionally, it is important to be able to measure the performance of a given system so that the impacts of operating at different levels of energy efficiency can be analyzed. The goal of this research is to examine how energy and system performance interact with each other for a variety of environments. One part of this study considers a computing system and its corresponding workload based on the expectations for future environments of Department of Energy and Department of Defense interest. Numerous Heuristics are presented that maximize a performance metric created using utility functions. Additional heuristics and energy filtering techniques have been designed for a computing system that has the goal of maximizing the total utility earned while being subject to an energy constraint. A framework has been established to analyze the trade-offs between performance (utility earned) and energy consumption. Stochastic models are used to create "fuzzy" Pareto fronts to analyze the variability of solutions along the Pareto front when uncertainties in execution time and power consumption are present within a system. In addition to using utility earned as a measure of system performance, system makespan has also been studied. Finally, a framework has been developed that enables the investigation of the effects of P-states and memory interference on energy consumption and system performance

    Resource management for extreme scale high performance computing systems in the presence of failures

    Get PDF
    2018 Summer.Includes bibliographical references.High performance computing (HPC) systems, such as data centers and supercomputers, coordinate the execution of large-scale computation of applications over tens or hundreds of thousands of multicore processors. Unfortunately, as the size of HPC systems continues to grow towards exascale complexities, these systems experience an exponential growth in the number of failures occurring in the system. These failures reduce performance and increase energy use, reducing the efficiency and effectiveness of emerging extreme-scale HPC systems. Applications executing in parallel on individual multicore processors also suffer from decreased performance and increased energy use as a result of applications being forced to share resources, in particular, the contention from multiple application threads sharing the last-level cache causes performance degradation. These challenges make it increasingly important to characterize and optimize the performance and behavior of applications that execute in these systems. To address these challenges, in this dissertation we propose a framework for intelligently characterizing and managing extreme-scale HPC system resources. We devise various techniques to mitigate the negative effects of failures and resource contention in HPC systems. In particular, we develop new HPC resource management techniques for intelligently utilizing system resources through the (a) optimal scheduling of applications to HPC nodes and (b) the optimal configuration of fault resilience protocols. These resource management techniques employ information obtained from historical analysis as well as theoretical and machine learning methods for predictions. We use these data to characterize system performance, energy use, and application behavior when operating under the uncertainty of performance degradation from both system failures and resource contention. We investigate how to better characterize and model the negative effects from system failures as well as application co-location on large-scale HPC computing systems. Our analysis of application and system behavior also investigates: the interrelated effects of network usage of applications and fault resilience protocols; checkpoint interval selection and its sensitivity to system parameters for various checkpoint-based fault resilience protocols; and performance comparisons of various promising strategies for fault resilience in exascale-sized systems
    • …
    corecore