14 research outputs found
Fibers are not (P)Threads: The Case for Loose Coupling of Asynchronous Programming Models and MPI Through Continuations
Asynchronous programming models (APM) are gaining more and more traction,
allowing applications to expose the available concurrency to a runtime system
tasked with coordinating the execution. While MPI has long provided support for
multi-threaded communication and non-blocking operations, it falls short of
adequately supporting APMs as correctly and efficiently handling MPI
communication in different models is still a challenge. Meanwhile, new
low-level implementations of light-weight, cooperatively scheduled execution
contexts (fibers, aka user-level threads (ULT)) are meant to serve as a basis
for higher-level APMs and their integration in MPI implementations has been
proposed as a replacement for traditional POSIX thread support to alleviate
these challenges.
In this paper, we first establish a taxonomy in an attempt to clearly
distinguish different concepts in the parallel software stack. We argue that
the proposed tight integration of fiber implementations with MPI is neither
warranted nor beneficial and instead is detrimental to the goal of MPI being a
portable communication abstraction. We propose MPI Continuations as an
extension to the MPI standard to provide callback-based notifications on
completed operations, leading to a clear separation of concerns by providing a
loose coupling mechanism between MPI and APMs. We show that this interface is
flexible and interacts well with different APMs, namely OpenMP detached tasks,
OmpSs-2, and Argobots.Comment: 12 pages, 7 figures Published in proceedings of EuroMPI/USA '20,
September 21-24, 2020, Austin, TX, US
Extending the Functionality of Score-P through Plugins: Interfaces and Use Cases
Performance measurement and runtime tuning tools are both vital in the HPC software ecosystem and use similar techniques: the analyzed application is interrupted at specific events and information on the current system state is gathered to be either recorded or used for tuning. One of the established performance measurement tools is Score-P. It supports numerous HPC platforms and parallel programming paradigms. To extend Score-P with support for different back-ends, create a common framework for measurement and tuning of HPC applications, and to enable the re-use of common software components such as implemented instrumentation techniques, this paper makes the following contributions: (I) We describe the Score-P metric plugin interface, which enables programmers to augment the event stream with metric data from supplementary data sources that are otherwise not accessible for Score-P. (II) We introduce the flexible Score-P substrate plugin interface that can be used for custom processing of the event stream according to the specific requirements of either measurement, analysis, or runtime tuning tasks. (III) We provide examples for both interfaces that extend Score-P’s functionality for monitoring and tuning purposes
MPI Application Binary Interface Standardization
MPI is the most widely used interface for high-performance computing (HPC)
workloads. Its success lies in its embrace of libraries and ability to evolve
while maintaining backward compatibility for older codes, enabling them to run
on new architectures for many years. In this paper, we propose a new level of
MPI compatibility: a standard Application Binary Interface (ABI). We review the
history of MPI implementation ABIs, identify the constraints from the MPI
standard and ISO C, and summarize recent efforts to develop a standard ABI for
MPI. We provide the current proposal from the MPI Forum's ABI working group,
which has been prototyped both within MPICH and as an independent abstraction
layer called Mukautuva. We also list several use cases that would benefit from
the definition of an ABI while outlining the remaining constraints
Глобализация, региональная интеграция и экономическое развитие
The paper deals with the energy consumption evaluation of the Finite Element Tearing and Interconnect (FETI) based solvers of linear systems, which is an established method for solving real-world engineering problems. Authors evaluated the effect of the CPU frequency on the energy consumption of the FETI solver using a linear elasticity 3D cube synthetic benchmark. In this problem, the effect of frequency tuning on the energy consumption of the essential processing kernels of the FETI method was evaluated. The paper provides results for two types of frequency tuning: (1) static tuning and (2) dynamic tuning. For static tuning experiments, the frequency is set before execution and kept constant during the runtime. For dynamic tuning, the frequency is changed during the program execution to adapt the system to the actual needs of the application. The paper shows that static tuning brings up 12% energy savings when compared to default CPU settings (the highest clock rate). The dynamic tuning improves this further by up to 3%
The EU Center of Excellence for Exascale in Solid Earth (ChEESE): Implementation, results, and roadmap for the second phase
publishedVersio
Scalable Tools for Non-Intrusive Performance Debugging of Parallel Linux Workloads
There is a variety of tools to measure the performance of Linux systems and the applications running on them. However, the resulting performance data is often presented in plain text format or only with a very basic user interface. For large systems with many cores and concurrent threads, it is increasingly difficult to present the data in a clear way for analysis. Moreover, certain performance analysis and debugging tasks require the use of a high-resolution time-line based approach, again entailing data visualization challenges. Tools in the area of High Performance Computing (HPC) have long been able to scale to hundreds or thousands of parallel threads and help finding performance anomalies. We therefore present a solution to gather performance data using Linux performance monitoring interfaces. A combination of sampling and careful instrumentation allows us to obtain detailed performance traces with manageable overhead. We then convert the resulting output to the Open Trace Format (OTF) to bridge the gap between the recording infrastructure and HPC analysis tools. We explore ways to visualize the data by using the graphical tool Vampir. The combination of established Linux and HPC tools allows us to create an interface for easy navigation through time-ordered performance data grouped by thread or CPU and to help users find opportunities for performance optimizations
Scalable Tools for Non-Intrusive Performance Debugging of Parallel Linux Workloads
There is a variety of tools to measure the performance of Linux systems and the applications running on them. However, the resulting performance data is often presented in plain text format or only with a very basic user interface. For large systems with many cores and concurrent threads, it is increasingly difficult to present the data in a clear way for analysis. Moreover, certain performance analysis and debugging tasks require the use of a high-resolution time-line based approach, again entailing data visualization challenges. Tools in the area of High Performance Computing (HPC) have long been able to scale to hundreds or thousands of parallel threads and help finding performance anomalies. We therefore present a solution to gather performance data using Linux performance monitoring interfaces. A combination of sampling and careful instrumentation allows us to obtain detailed performance traces with manageable overhead. We then convert the resulting output to the Open Trace Format (OTF) to bridge the gap between the recording infrastructure and HPC analysis tools. We explore ways to visualize the data by using the graphical tool Vampir. The combination of established Linux and HPC tools allows us to create an interface for easy navigation through time-ordered performance data grouped by thread or CPU and to help users find opportunities for performance optimizations
Run-Time Exploitation of Application Dynamism for Energy-Efficient Exascale Computing (READEX)
Efficiently utilizing the resources provided on current petascale and future exascale systems will be a challenging task, potentially causing a large amount of underutilized resources and wasted energy. A promising potential to improve efficiency of HPC applications stems from the significant degree of dynamic behavior, e.g., run-time alternation in application resource requirements in HPC workloads. Manually detecting and leveraging this dynamism to improve performance and energy-efficiency is a tedious task that is commonly neglected by developers. However, using an automatic optimization approach, application dynamism can be analyzed at design-time and used to optimize system configurations at run-time.
The European Union Horizon 2020 READEX project will develop a tools-aided scenario based auto-tuning methodology to exploit the dynamic behavior of HPC applications to achieve improved energy-efficiency and performance. Driven by a consortium of European experts from academia, HPC resource providers, and industry, the READEX project aims at developing the first of its kind generic framework for split design-time runtime automatic tuning for heterogeneous system at the exascale level
READEX: Linking Two Ends of the Computing Continuum to Improve Energy-efficiency in Dynamic Applications
In both the embedded systems and High Performance Computing domains, energy-efficiency has become one of the main design criteria. Efficiently utilizing the resources provided in computing systems ranging from embedded systems to current petascale and future Exascale HPC systems will be a challenging task. Suboptimal designs can potentially cause large amounts of underutilized resources and wasted energy. In both domains, a promising potential for improving efficiency of scalable applications stems from the significant degree of dynamic behaviour, e.g., runtime alternation in application resource requirements and workloads. Manually detecting and leveraging this dynamism to improve performance and energy-efficiency is a tedious task that is commonly neglected by developers. However, using an automatic optimization approach, application dynamism can be analysed at design time and used to optimize system configurations at runtime. The European Union Horizon 2020 READEX (Runtime Exploitation of Application Dynamism for Energy-efficient eXascale computing) project will develop a tools-aided auto-tuning methodology inspired by the system scenario methodology used in embedded systems. Dynamic behaviour of HPC applications will be exploited to achieve improved energy-efficiency and performance. Driven by a consortium of European experts from academia, HPC resource providers, and industry, the READEX project aims at developing the first of its kind generic framework to split design time and runtime automatic tuning while targeting heterogeneous system at the Exascale level. This paper describes plans for the project as well as early results achieved during its first year. Furthermore, it is shown how project results will be brought back into the embedded systems domain