3 research outputs found
Energy Demand Response for High-Performance Computing Systems
The growing computational demand of scientific applications has greatly motivated the development of large-scale high-performance computing (HPC) systems in the past decade. To accommodate the increasing demand of applications, HPC systems have been going through dramatic architectural changes (e.g., introduction of many-core and multi-core systems, rapid growth of complex interconnection network for efficient communication between thousands of nodes), as well as significant increase in size (e.g., modern supercomputers consist of hundreds of thousands of nodes). With such changes in architecture and size, the energy consumption by these systems has increased significantly. With the advent of exascale supercomputers in the next few years, power consumption of the HPC systems will surely increase; some systems may even consume hundreds of megawatts of electricity. Demand response programs are designed to help the energy service providers to stabilize the power system by reducing the energy consumption of participating systems during the time periods of high demand power usage or temporary shortage in power supply.
This dissertation focuses on developing energy-efficient demand-response models and algorithms to enable HPC system\u27s demand response participation. In the first part, we present interconnection network models for performance prediction of large-scale HPC applications. They are based on interconnected topologies widely used in HPC systems: dragonfly, torus, and fat-tree. Our interconnect models are fully integrated with an implementation of message-passing interface (MPI) that can mimic most of its functions with packet-level accuracy. Extensive experiments show that our integrated models provide good accuracy for predicting the network behavior, while at the same time allowing for good parallel scaling performance. In the second part, we present an energy-efficient demand-response model to reduce HPC systems\u27 energy consumption during demand response periods. We propose HPC job scheduling and resource provisioning schemes to enable HPC system\u27s emergency demand response participation. In the final part, we propose an economic demand-response model to allow both HPC operator and HPC users to jointly reduce HPC system\u27s energy cost. Our proposed model allows the participation of HPC systems in economic demand-response programs through a contract-based rewarding scheme that can incentivize HPC users to participate in demand response
Scalability in the Presence of Variability
Supercomputers are used to solve some of the world’s most computationally demanding
problems. Exascale systems, to be comprised of over one million cores and capable of 10^18
floating point operations per second, will probably exist by the early 2020s, and will provide
unprecedented computational power for parallel computing workloads. Unfortunately,
while these machines hold tremendous promise and opportunity for applications in High
Performance Computing (HPC), graph processing, and machine learning, it will be a major
challenge to fully realize their potential, because to do so requires balanced execution across
the entire system and its millions of processing elements. When different processors take different
amounts of time to perform the same amount of work, performance imbalance arises,
large portions of the system sit idle, and time and energy are wasted. Larger systems incorporate
more processors and thus greater opportunity for imbalance to arise, as well as larger
performance/energy penalties when it does. This phenomenon is referred to as performance
variability and is the focus of this dissertation.
In this dissertation, we explain how to design system software to mitigate variability
on large scale parallel machines. Our approaches span (1) the design, implementation, and
evaluation of a new high performance operating system to reduce some classes of performance
variability, (2) a new performance evaluation framework to holistically characterize
key features of variability on new and emerging architectures, and (3) a distributed modeling
framework that derives predictions of how and where imbalance is manifesting in order to
drive reactive operations such as load balancing and speed scaling. Collectively, these efforts
provide a holistic set of tools to promote scalability through the mitigation of variability