64 research outputs found

    Resource management in heterogeneous computing systems with tasks of varying importance

    Get PDF
    2014 Summer.The problem of efficiently assigning tasks to machines in heterogeneous computing environments where different tasks can have different levels of importance (or value) to the computing system is a challenging one. The goal of this work is to study this problem in a variety of environments. One part of the study considers a computing system and its corresponding workload based on the expectations for future environments of Department of Energy and Department of Defense interest. We design heuristics to maximize a performance metric created using utility functions. We also create a framework to analyze the trade-offs between performance and energy consumption. We design techniques to maximize performance in a dynamic environment that has a constraint on the energy consumption. Another part of the study explores environments that have uncertainty in the availability of the compute resources. For this part, we design heuristics and compare their performance in different types of environments

    Resource management for extreme scale high performance computing systems in the presence of failures

    Get PDF
    2018 Summer.Includes bibliographical references.High performance computing (HPC) systems, such as data centers and supercomputers, coordinate the execution of large-scale computation of applications over tens or hundreds of thousands of multicore processors. Unfortunately, as the size of HPC systems continues to grow towards exascale complexities, these systems experience an exponential growth in the number of failures occurring in the system. These failures reduce performance and increase energy use, reducing the efficiency and effectiveness of emerging extreme-scale HPC systems. Applications executing in parallel on individual multicore processors also suffer from decreased performance and increased energy use as a result of applications being forced to share resources, in particular, the contention from multiple application threads sharing the last-level cache causes performance degradation. These challenges make it increasingly important to characterize and optimize the performance and behavior of applications that execute in these systems. To address these challenges, in this dissertation we propose a framework for intelligently characterizing and managing extreme-scale HPC system resources. We devise various techniques to mitigate the negative effects of failures and resource contention in HPC systems. In particular, we develop new HPC resource management techniques for intelligently utilizing system resources through the (a) optimal scheduling of applications to HPC nodes and (b) the optimal configuration of fault resilience protocols. These resource management techniques employ information obtained from historical analysis as well as theoretical and machine learning methods for predictions. We use these data to characterize system performance, energy use, and application behavior when operating under the uncertainty of performance degradation from both system failures and resource contention. We investigate how to better characterize and model the negative effects from system failures as well as application co-location on large-scale HPC computing systems. Our analysis of application and system behavior also investigates: the interrelated effects of network usage of applications and fault resilience protocols; checkpoint interval selection and its sensitivity to system parameters for various checkpoint-based fault resilience protocols; and performance comparisons of various promising strategies for fault resilience in exascale-sized systems

    Resource management for heterogeneous computing systems: utility maximization, energy-aware scheduling, and multi-objective optimization

    Get PDF
    Includes bibliographical references.2015 Summer.As high performance heterogeneous computing systems continually become faster, the operating cost to run these systems has increased. A significant portion of the operating costs can be attributed to the amount of energy required for these systems to operate. To reduce these costs it is important for system administrators to operate these systems in an energy efficient manner. Additionally, it is important to be able to measure the performance of a given system so that the impacts of operating at different levels of energy efficiency can be analyzed. The goal of this research is to examine how energy and system performance interact with each other for a variety of environments. One part of this study considers a computing system and its corresponding workload based on the expectations for future environments of Department of Energy and Department of Defense interest. Numerous Heuristics are presented that maximize a performance metric created using utility functions. Additional heuristics and energy filtering techniques have been designed for a computing system that has the goal of maximizing the total utility earned while being subject to an energy constraint. A framework has been established to analyze the trade-offs between performance (utility earned) and energy consumption. Stochastic models are used to create "fuzzy" Pareto fronts to analyze the variability of solutions along the Pareto front when uncertainties in execution time and power consumption are present within a system. In addition to using utility earned as a measure of system performance, system makespan has also been studied. Finally, a framework has been developed that enables the investigation of the effects of P-states and memory interference on energy consumption and system performance

    Towards an open cloud marketplace: vision and first steps

    Full text link
    As one of the most promising, emerging concepts in Information Technology (IT), cloud computing is transforming how IT is consumed and managed; yielding improved cost efficiencies, and delivering flexible, on-demand scalability by reducing computing infrastructures, platforms, and services to commodities acquired and paid-for on-demand through a set of cloud providers. Today, the transition of cloud computing from a subject of research and innovation to a critical infrastructure is proceeding at an incredibly fast pace. A potentially dangerous consequence of this speedy transition to practice is the premature adoption, and ossification, of the models, technologies, and standards underlying this critical infrastructure. This state of affairs is exacerbated by the fact that innovative research on production-scale platforms is becoming the purview of a small number of public cloud providers. Specifically, the academic research communities are effectively excluded from the opportunity to contribute meaningfully to the evolution not to mention innovation and healthy mutation of cloud computing technologies. As the dependence on our society and economy on cloud computing increases, so does the realization that the academic research community cannot be shut out from contributing to the design and evolution of this critical infrastructure. In this article we provide an alternative vision that of an Open Cloud eXchange (OCX) a public cloud marketplace, where many stakeholders, rather than just a single cloud provider, participate in implementing and operating the cloud, thus creating an ecosystem that will bring the innovation of a broader community to bear on a much healthier and more efficient cloud marketplace

    Task Packing: Efficient task scheduling in unbalanced parallel programs to maximize CPU utilization

    Get PDF
    Load imbalance in parallel systems can be generated by external factors to the currently running applications like operating system noise or the underlying hardware like a heterogeneous cluster. HPC applications working on irregular data structures can also have difficulties to balance their computations across the parallel tasks. In this article we extend, improve and evaluate more deeply the Task Packing mechanism proposed in a previous work. The main idea of the mechanism is to concentrate the idle cycles of unbalanced applications in such a way that one or more CPUs are freed from execution. To achieve this, CPUs are stressed with just useful work of the parallel application tasks, provided performance is not degraded. The packing is solved by an algorithm based on the Knapsack problem, in a minimum number of CPUs and using oversubscription. We design and implement a more efficient version of such mechanism. To that end, we perform the Task Packing “in place”, taking advantage of idle cycles generated at synchronization points of unbalanced applications. Evaluations are carried out on a heterogeneous platform using FT and miniFE benchmarks. Results showed that our proposal generates low overhead. In addition the amount of freed CPUs are related to a load imbalance metric which can be used as a prediction for it.Peer ReviewedPostprint (author's final draft

    Value and energy aware adaptive resource allocation of soft real-time jobs on many-core HPC data centers

    Get PDF
    Modern high performance computing (HPC) data centers consume huge energy to operate them. Therefore, appropriate measures are required to reduce their energy consumption. Existing efforts for such measures focus on consolidation and dynamic voltage and frequency scaling (DVFS). However, most of them do not perform adaptive resource allocation for the executing dependent tasks (or jobs) in order to optimize both value and energy. The value is achieved by completing the execution of a job and it depends on the completion time. A high value is achieved if the job is completed before its deadline, otherwise a lower value. In this paper, we propose an adaptive resource allocation approach that uses design-time profiling results of jobs for efficient allocation and adaptation in order to optimize both value and energy while executing dependent tasks. The profiling results for each job are obtained by exploiting efficient allocation combined with identification of voltage/frequency levels of used system cores and used in adapting to different number of cores based on the monitored execution progress of the job and available cores. Experiments show that the proposed approach enhances the overall value by about 10% when compare to existing approaches while showing reduction in energy consumption and percentage of rejected jobs leading to zero value
    corecore