447 research outputs found

    Managing the Topology of Heterogeneous Cluster Nodes with Hardware Locality (hwloc)

    Get PDF
    International audienceModern computing platforms are increasingly complex, with multiple cores, shared caches, and NUMA architectures. Parallel applications developers have to take locality into account before they can expect good efficiency on these platforms. Thus there is a strong need for a portable tool gathering and exposing this information. The Hardware Locality project (hwloc) offers a tree representation of the hardware based on the inclusion and localities of the CPU and memory resources. It is already widely used for affinity-based task placement in high performance computing. In this article we present how hwloc is extended to describe more than computing and memory resources. Indeed, I/O device locality is becoming another important aspect of locality since high performance GPUs, network or InfiniBand interfaces possess privileged access to some of the cores and memory banks. hwloc integrates this knowledge into its topology representation and offers an interoperability API to extend existing libraries such as CUDA with locality information. We also describe how hwloc now helps process managers and batch schedulers to deal with the topology of multiple cluster nodes, together with compression for better scalability up to thousands of nodes

    Productive and efficient computational science through domain-specific abstractions

    Get PDF
    In an ideal world, scientific applications are computationally efficient, maintainable and composable and allow scientists to work very productively. We argue that these goals are achievable for a specific application field by choosing suitable domain-specific abstractions that encapsulate domain knowledge with a high degree of expressiveness. This thesis demonstrates the design and composition of domain-specific abstractions by abstracting the stages a scientist goes through in formulating a problem of numerically solving a partial differential equation. Domain knowledge is used to transform this problem into a different, lower level representation and decompose it into parts which can be solved using existing tools. A system for the portable solution of partial differential equations using the finite element method on unstructured meshes is formulated, in which contributions from different scientific communities are composed to solve sophisticated problems. The concrete implementations of these domain-specific abstractions are Firedrake and PyOP2. Firedrake allows scientists to describe variational forms and discretisations for linear and non-linear finite element problems symbolically, in a notation very close to their mathematical models. PyOP2 abstracts the performance-portable parallel execution of local computations over the mesh on a range of hardware architectures, targeting multi-core CPUs, GPUs and accelerators. Thereby, a separation of concerns is achieved, in which Firedrake encapsulates domain knowledge about the finite element method separately from its efficient parallel execution in PyOP2, which in turn is completely agnostic to the higher abstraction layer. As a consequence of the composability of those abstractions, optimised implementations for different hardware architectures can be automatically generated without any changes to a single high-level source. Performance matches or exceeds what is realistically attainable by hand-written code. Firedrake and PyOP2 are combined to form a tool chain that is demonstrated to be competitive with or faster than available alternatives on a wide range of different finite element problems.Open Acces

    Programming Models\u27 Support for Heterogeneous Architecture

    Get PDF
    Accelerator-enhanced computing platforms have drawn a lot of attention due to their massive peak computational capacity. Heterogeneous systems equipped with accelerators such as GPUs have become the most prominent components of High Performance Computing (HPC) systems. Even at the node level the significant heterogeneity of CPU and GPU, i.e. hardware and memory space differences, leads to challenges for fully exploiting such complex architectures. Extending outside the node scope, only escalate such challenges. Conventional programming models such as data- ow and message passing have been widely adopted in HPC communities. When moving towards heterogeneous systems, the lack of GPU integration causes such programming models to struggle in handling the heterogeneity of different computing units, leading to sub-optimal performance and drastic decrease in developer productivity. To bridge the gap between underlying heterogeneous architectures and current programming paradigms, we propose to extend such programming paradigms with architecture awareness optimization. Two programming models are used to demonstrate the impact of heterogeneous architecture awareness. The PaRSEC task-based runtime, an adopter of the data- ow model, provides opportunities for overlapping communications with computations and minimizing data movements, as well as dynamically adapting the work granularity to the capability of the hardware. To fulfill the demand of an efficient and portable Message Passing Interface (MPI) implementation to communicate GPU data, a GPU-aware design is presented based on the Open MPI infrastructure supporting efficient point-to-point and collective communications of GPU-residential data, for both contiguous and non-contiguous memory layouts, by leveraging GPU network topology and hardware capabilities such as GPUDirect. The tight integration of GPU support in a widely used programming environment, free the developers from manually move data into/out of host memory before/after relying on MPI routines for communications, allowing them to focus instead on algorithmic optimizations. Experimental results have confirmed that supported by such a tight and transparent integration, conventional programming models can once again take advantage of the state-of-the-art hardware and exhibit performance at the levels expected by the underlying hardware capabilities

    Programming Abstractions for Data Locality

    Get PDF
    The goal of the workshop and this report is to identify common themes and standardize concepts for locality-preserving abstractions for exascale programming models. Current software tools are built on the premise that computing is the most expensive component, we are rapidly moving to an era that computing is cheap and massively parallel while data movement dominates energy and performance costs. In order to respond to exascale systems (the next generation of high performance computing systems), the scientific computing community needs to refactor their applications to align with the emerging data-centric paradigm. Our applications must be evolved to express information about data locality. Unfortunately current programming environments offer few ways to do so. They ignore the incurred cost of communication and simply rely on the hardware cache coherency to virtualize data movement. With the increasing importance of task-level parallelism on future systems, task models have to support constructs that express data locality and affinity. At the system level, communication libraries implicitly assume all the processing elements are equidistant to each other. In order to take advantage of emerging technologies, application developers need a set of programming abstractions to describe data locality for the new computing ecosystem. The new programming paradigm should be more data centric and allow to describe how to decompose and how to layout data in the memory.Fortunately, there are many emerging concepts such as constructs for tiling, data layout, array views, task and thread affinity, and topology aware communication libraries for managing data locality. There is an opportunity to identify commonalities in strategy to enable us to combine the best of these concepts to develop a comprehensive approach to expressing and managing data locality on exascale programming systems. These programming model abstractions can expose crucial information about data locality to the compiler and runtime system to enable performance-portable code. The research question is to identify the right level of abstraction, which includes techniques that range from template libraries all the way to completely new languages to achieve this goal

    Towards Fast and Portable Microkernels

    Get PDF
    • …
    corecore