2,157 research outputs found
Evaluating Emerging CXL-enabled Memory Pooling for HPC Systems
Current HPC systems provide memory resources that are statically configured
and tightly coupled with compute nodes. However, workloads on HPC systems are
evolving. Diverse workloads lead to a need for configurable memory resources to
achieve high performance and utilization. In this study, we evaluate a memory
subsystem design leveraging CXL-enabled memory pooling. Two promising use cases
of composable memory subsystems are studied -- fine-grained capacity
provisioning and scalable bandwidth provisioning. We developed an emulator to
explore the performance impact of various memory compositions. We also provide
a profiler to identify the memory usage patterns in applications and their
optimization opportunities. Seven scientific and six graph applications are
evaluated on various emulated memory configurations. Three out of seven
scientific applications had less than 10% performance impact when the pooled
memory backed 75% of their memory footprint. The results also show that a
dynamically configured high-bandwidth system can effectively support
bandwidth-intensive unstructured mesh-based applications like OpenFOAM.
Finally, we identify interference through shared memory pools as a practical
challenge for adoption on HPC systems.Comment: 10 pages, 13 figures. Accepted for publication in Workshop on Memory
Centric High Performance Computing (MCHPC'22) at SC2
Power efficient job scheduling by predicting the impact of processor manufacturing variability
Modern CPUs suffer from performance and power consumption variability due to the manufacturing process. As a result, systems that do not consider such variability caused by manufacturing issues lead to performance degradations and wasted power. In order to avoid such negative impact, users and system administrators must actively counteract any manufacturing variability.
In this work we show that parallel systems benefit from taking into account the consequences of manufacturing variability when making scheduling decisions at the job scheduler level. We also show that it is possible to predict the impact of this variability on specific applications by using variability-aware power prediction models. Based on these power models, we propose two job scheduling policies that consider the effects of manufacturing variability for each application and that ensure that power consumption stays under a system-wide power budget. We evaluate our policies under different power budgets and traffic scenarios, consisting of both single- and multi-node parallel applications, utilizing up to 4096 cores in total. We demonstrate that they decrease job turnaround time, compared to contemporary scheduling policies used on production clusters, up to 31% while saving up to 5.5% energy.Postprint (author's final draft
A Quantitative Approach for Adopting Disaggregated Memory in HPC Systems
Memory disaggregation has recently been adopted in data centers to improve
resource utilization, motivated by cost and sustainability. Recent studies on
large-scale HPC facilities have also highlighted memory underutilization. A
promising and non-disruptive option for memory disaggregation is rack-scale
memory pooling, where shared memory pools supplement node-local memory. This
work outlines the prospects and requirements for adoption and clarifies several
misconceptions. We propose a quantitative method for dissecting application
requirements on the memory system from the top down in three levels, moving
from general, to multi-tier memory systems, and then to memory pooling. We
provide a multi-level profiling tool and LBench to facilitate the quantitative
approach. We evaluate a set of representative HPC workloads on an emulated
platform. Our results show that prefetching activities can significantly
influence memory traffic profiles. Interference in memory pooling has varied
impacts on applications, depending on their access ratios to memory tiers and
arithmetic intensities. Finally, in two case studies, we show the benefits of
our findings at the application and system levels, achieving 50% reduction in
remote access and 13% speedup in BFS, and reducing performance variation of
co-located workloads in interference-aware job scheduling.Comment: Accepted to SC23 (The International Conference for High Performance
Computing, Networking, Storage, and Analysis 2023
Scheduling Heterogeneous HPC Applications in Next-Generation Exascale Systems
Next generation HPC applications will increasingly time-share system resources with emerging workloads such as in-situ analytics, resilience tasks, runtime adaptation services and power management activities. HPC systems must carefully schedule these co-located codes in order to reduce their impact on application performance. Among the techniques traditionally used to mitigate the performance effects of time- share systems is gang scheduling. This approach, however, leverages global synchronization and time agreement mechanisms that will become hard to support as systems increase in size. Alternative performance interference mitigation approaches must be explored for future HPC systems. This dissertation evaluates the impacts of workload concurrency in future HPC systems. It uses simulation and modeling techniques to study the performance impacts of existing and emerging interference sources on a selection of HPC benchmarks, mini-applications, and applications. It also quantifies the cost and benefits of different approaches to scheduling co-located workloads, studies performance interference mitigation solutions based on gang scheduling, and examines their synchronization requirements. To do so, this dissertation presents and leverages a new Extreme Value Theory- based model to characterize interference sources, and investigate their impact on Bulk Synchronous Parallel (BSP) applications. It demonstrates how this model can be used to analyze the interference attenuation effects of alternative fine-grained OS scheduling approaches based on periodic real time schedulers. This analysis can, in turn, guide the design of those mitigation techniques by providing tools to understand the tradeoffs of selecting scheduling parameters
High Performance Embedded Computing
Nowadays, the prevalence of computing systems in our lives is so ubiquitous that we live in a cyber-physical world dominated by computer systems, from pacemakers to cars and airplanes. These systems demand for more computational performance to process large amounts of data from multiple data sources with guaranteed processing times. Actuating outside of the required timing bounds may cause the failure of the system, being vital for systems like planes, cars, business monitoring, e-trading, etc. High-Performance and Time-Predictable Embedded Computing presents recent advances in software architecture and tools to support such complex systems, enabling the design of embedded computing devices which are able to deliver high-performance whilst guaranteeing the application required timing bounds. Technical topics discussed in the book include: Parallel embedded platforms Programming models Mapping and scheduling of parallel computations Timing and schedulability analysis Runtimes and operating systemsThe work reflected in this book was done in the scope of the European project P SOCRATES, funded under the FP7 framework program of the European Commission. High-performance and time-predictable embedded computing is ideal for personnel in computer/communication/embedded industries as well as academic staff and master/research students in computer science, embedded systems, cyber-physical systems and internet-of-things
High-Performance and Time-Predictable Embedded Computing
Nowadays, the prevalence of computing systems in our lives is so ubiquitous that we live in a cyber-physical world dominated by computer systems, from pacemakers to cars and airplanes. These systems demand for more computational performance to process large amounts of data from multiple data sources with guaranteed processing times. Actuating outside of the required timing bounds may cause the failure of the system, being vital for systems like planes, cars, business monitoring, e-trading, etc.
High-Performance and Time-Predictable Embedded Computing presents recent advances in software architecture and tools to support such complex systems, enabling the design of embedded computing devices which are able to deliver high-performance whilst guaranteeing the application required timing bounds.
Technical topics discussed in the book include: Parallel embedded platforms Programming models Mapping and scheduling of parallel computations Timing and schedulability analysis Runtimes and operating systems
The work reflected in this book was done in the scope of the European project P SOCRATES, funded under the FP7 framework program of the European Commission. High-performance and time-predictable embedded computing is ideal for personnel in computer/communication/embedded industries as well as academic staff and master/research students in computer science, embedded systems, cyber-physical systems and internet-of-things.info:eu-repo/semantics/publishedVersio
Scalability in the Presence of Variability
Supercomputers are used to solve some of the world’s most computationally demanding
problems. Exascale systems, to be comprised of over one million cores and capable of 10^18
floating point operations per second, will probably exist by the early 2020s, and will provide
unprecedented computational power for parallel computing workloads. Unfortunately,
while these machines hold tremendous promise and opportunity for applications in High
Performance Computing (HPC), graph processing, and machine learning, it will be a major
challenge to fully realize their potential, because to do so requires balanced execution across
the entire system and its millions of processing elements. When different processors take different
amounts of time to perform the same amount of work, performance imbalance arises,
large portions of the system sit idle, and time and energy are wasted. Larger systems incorporate
more processors and thus greater opportunity for imbalance to arise, as well as larger
performance/energy penalties when it does. This phenomenon is referred to as performance
variability and is the focus of this dissertation.
In this dissertation, we explain how to design system software to mitigate variability
on large scale parallel machines. Our approaches span (1) the design, implementation, and
evaluation of a new high performance operating system to reduce some classes of performance
variability, (2) a new performance evaluation framework to holistically characterize
key features of variability on new and emerging architectures, and (3) a distributed modeling
framework that derives predictions of how and where imbalance is manifesting in order to
drive reactive operations such as load balancing and speed scaling. Collectively, these efforts
provide a holistic set of tools to promote scalability through the mitigation of variability
- …