Search CORE

Oskar Bordeaux

On the Overhead of Topology Discovery for Locality-aware Scheduling in HPC

Author: Goglin Brice
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

International audienceThe increasing complexity of parallel computing platforms requires a deep knowledge of the hardware and of the application needs. Locality a key criteria for performance optimization. It involves software tools to expose information about the hardware topology to high performance runtime libraries. We show that the overhead of gathering such information from the operating system is significant on large computing nodes that run Linux. This overhead also increases more than linearly with the number of processes that perform it simultaneously. We then study the actual needs of the HPC software ecosystem in terms of topology information. We propose some ways to avoid multiple expensive topology discovery and to share topology information between components such as the resource manager or the runtime libraries

Process Affinity, Metrics and Impact on Performance: an Empirical Study

Author: Bordage Cyril
Jeannot Emmanuel
Publication venue: HAL CCSD
Publication date: 01/05/2018
Field of study

International audienceProcess placement, also called topology mapping, is a well-known strategy to improve parallel program execution by reducing the communication cost between processes. It requires two inputs: the topology of the target machine and a measure of the affinity between processes. In the literature, the dominant affinity measure is the communication matrix that describes the amount of communication between processes. The goal of this paper is to study the accuracy of the communication matrix as a measure of affinity. We have done an extensive set of tests with two fat-tree machines and a 3d-torus machine to evaluate several hypotheses that are often made in the literature and to discuss their validity. First, we check the correlation between algorithmic metrics and the performance of the application. Then, we check whether a good generic process placement algorithm never degrades performance. And finally, we see whether the structure of the communication matrix can be used to predict gain. I. INTRODUCTION We are currently seeing a deepening in the hierarchy of high-performance computing system. Nodes are composed of multicore processors with different levels of memory (standard DRAM, non-volatile memory, faster but smaller MCDRAM for KNL, etc.) and the network interconnecting these nodes can also be highly intricate with complex topology and high diameter. The consequence of these architectural features is that the performance of the parallel applications highly depends on the nodes allocated for the job as well as the mapping of these jobs. Process placement (also known as topology mapping) is an active field of research that deals with the development of strategies targeting the improvement of parallel applications by carefully allocating processes onto the resources [14]. The goal is to reduce the communication by mapping close to each other processes that communicate the most. The communication time depends on the algorithm implemented in the application: it depends on the quantity of data to be exchanged. Moreover, since all computing resources are not directly connected, it also depends on the distance between the running processes as well as the speed of the different links. Figure 1 shows what can be the distances (in number of hops) between cores in a fat-tree machine with 6 nodes with 24 cores each (two processors made of two NUMA nodes with 6 cores each). We see clearly blocks of same distances. Hence, it seems natural to put closer two processes that communicate a lot to reduce the communication cost. To this purpose, we need to adapt the execution of parallel applications to the target machine according to its specific topology

Memory Footprint of Locality Information on Many-Core Platforms

Author: Goglin Brice
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 21/05/2018
Field of study

International audienceExploiting the power of HPC platforms requires knowledge of their increasingly complex hardware topologies. Multiple components of the software stack, for instance MPI implementations or OpenMP runtimes, now perform their own topology discovery to find out the available cores and memory, and to better place tasks based on their affinities.We study in this article the impact of this topology discovery in terms of memory footprint. Storing locality information wastes an amount of physical memory that is becoming an issue on many-core platforms on the road to exascale.We demonstrate that this information may be factorized between processes by using a shared-memory region. Our analysis of the physical and virtual memories in supercomputing architectures shows that this shared region can be mapped at the same virtual address in all processes, hence dramatically simplifying the software implementation.Our implementation in hwloc and Open MPI shows a memory footprint that does not increase with the number of MPI ranks per node anymore. Moreover the job launch time is decreased by more than a factor of 2 on an Intel Knights Landing Xeon Phi and on a 96-core NUMA platform

Trends in Data Locality Abstractions for HPC Systems

Author: Amir Kamil
Anshu Dubey
Bradford L. Chamberlain
Chris J. Newburn
Didem Unat
Emmanuel Jeannot
Frank Hannig
H. Carter Edwards
Hal Finkel
Hatem Ltaief
Jeff Keasler
John Shalf
Karl Fuerlinger
Mark Abraham
Mauro Bianco
Miquel Pericas
Naoya Maruyama
Paul H J Kelly
Romain Cledat
Torsten Hoefler
Vitus Leung
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs

Author: Gutiérrez Samuel K.
Publication venue: UNM Digital Repository
Publication date: 01/12/2018
Field of study

Hybrid parallel programming models that combine message passing (MP) and shared- memory multithreading (MT) are becoming more popular, especially with applications requiring higher degrees of parallelism and scalability. Consequently, coupled parallel programs, those built via the integration of independently developed and optimized software libraries linked into a single application, increasingly comprise message-passing libraries with differing preferred degrees of threading, resulting in thread-level heterogeneity. Retroactively matching threading levels between independently developed and maintained libraries is difficult, and the challenge is exacerbated because contemporary middleware services provide only static scheduling policies over entire program executions, necessitating suboptimal, over-subscribed or under-subscribed, configurations. In coupled applications, a poorly configured component can lead to overall poor application performance, suboptimal resource utilization, and increased time-to-solution. So it is critical that each library executes in a manner consistent with its design and tuning for a particular system architecture and workload. Therefore, there is a need for techniques that address dynamic, conflicting configurations in coupled multithreaded message-passing (MT-MP) programs. Our thesis is that we can achieve significant performance improvements over static under-subscribed approaches through reconfigurable execution environments that consider compute phase parallelization strategies along with both hardware and software characteristics. In this work, we present new ways to structure, execute, and analyze coupled MT- MP programs. Our study begins with an examination of contemporary approaches used to accommodate thread-level heterogeneity in coupled MT-MP programs. Here we identify potential inefficiencies in how these programs are structured and executed in the high-performance computing domain. We then present and evaluate a novel approach for accommodating thread-level heterogeneity. Our approach enables full utilization of all available compute resources throughout an application’s execution by providing programmable facilities with modest overheads to dynamically reconfigure runtime environments for compute phases with differing threading factors and affinities. Our performance results show that for a majority of the tested scientific workloads our approach and corresponding open-source reference implementation render speedups greater than 50 % over the static under-subscribed baseline. Motivated by our examination of reconfigurable execution environments and their memory overhead, we also study the memory attribution problem: the inability to predict or evaluate during runtime where the available memory is used across the software stack comprising the application, reusable software libraries, and supporting runtime infrastructure. Specifically, dynamic adaptation requires runtime intervention, which by its nature introduces additional runtime and memory overhead. To better understand the latter, we propose and evaluate a new way to quantify component-level memory usage from unmodified binaries dynamically linked to a message-passing communication library. Our experimental results show that our approach and corresponding implementation accurately measure memory resource usage as a function of time, scale, communication workload, and software or hardware system architecture, clearly distinguishing between application and communication library usage at a per-process level

Programming Abstractions for Data Locality

The goal of the workshop and this report is to identify common themes and standardize concepts for locality-preserving abstractions for exascale programming models. Current software tools are built on the premise that computing is the most expensive component, we are rapidly moving to an era that computing is cheap and massively parallel while data movement dominates energy and performance costs. In order to respond to exascale systems (the next generation of high performance computing systems), the scientific computing community needs to refactor their applications to align with the emerging data-centric paradigm. Our applications must be evolved to express information about data locality. Unfortunately current programming environments offer few ways to do so. They ignore the incurred cost of communication and simply rely on the hardware cache coherency to virtualize data movement. With the increasing importance of task-level parallelism on future systems, task models have to support constructs that express data locality and affinity. At the system level, communication libraries implicitly assume all the processing elements are equidistant to each other. In order to take advantage of emerging technologies, application developers need a set of programming abstractions to describe data locality for the new computing ecosystem. The new programming paradigm should be more data centric and allow to describe how to decompose and how to layout data in the memory.Fortunately, there are many emerging concepts such as constructs for tiling, data layout, array views, task and thread affinity, and topology aware communication libraries for managing data locality. There is an opportunity to identify commonalities in strategy to enable us to combine the best of these concepts to develop a comprehensive approach to expressing and managing data locality on exascale programming systems. These programming model abstractions can expose crucial information about data locality to the compiler and runtime system to enable performance-portable code. The research question is to identify the right level of abstraction, which includes techniques that range from template libraries all the way to completely new languages to achieve this goal