66 research outputs found
A survey of techniques for reducing interference in real-time applications on multicore platforms
This survey reviews the scientific literature on techniques for reducing interference in real-time multicore systems, focusing on the approaches proposed between 2015 and 2020. It also presents proposals that use interference reduction techniques without considering the predictability issue. The survey highlights interference sources and categorizes proposals from the perspective of the shared resource. It covers techniques for reducing contentions in main memory, cache memory, a memory bus, and the integration of interference effects into schedulability analysis. Every section contains an overview of each proposal and an assessment of its advantages and disadvantages.This work was supported in part by the Comunidad de Madrid Government "Nuevas TĂ©cnicas de Desarrollo de Software de Tiempo Real Embarcado Para Plataformas. MPSoC de PrĂłxima GeneraciĂłn" under Grant IND2019/TIC-17261
Programming Abstractions for Data Locality
The goal of the workshop and this report is to identify common themes and standardize concepts for locality-preserving abstractions for exascale programming models. Current software tools are built on the premise that computing is the most expensive component, we are rapidly moving to an era that computing is cheap and massively parallel while data movement dominates energy and performance costs. In order to respond to exascale systems (the next generation of high performance computing systems), the scientific computing community needs to refactor their applications to align with the emerging data-centric paradigm. Our applications must be evolved to express information about data locality. Unfortunately current programming environments offer few ways to do so. They ignore the incurred cost of communication and simply rely on the hardware cache coherency to virtualize data movement. With the increasing importance of task-level parallelism on future systems, task models have to support constructs that express data locality and affinity. At the system level, communication libraries implicitly assume all the processing elements are equidistant to each other. In order to take advantage of emerging technologies, application developers need a set of programming abstractions to describe data locality for the new computing ecosystem. The new programming paradigm should be more data centric and allow to describe how to decompose and how to layout data in the memory.Fortunately, there are many emerging concepts such as constructs for tiling, data layout, array views, task and thread affinity, and topology aware communication libraries for managing data locality. There is an opportunity to identify commonalities in strategy to enable us to combine the best of these concepts to develop a comprehensive approach to expressing and managing data locality on exascale programming systems. These programming model abstractions can expose crucial information about data locality to the compiler and runtime system to enable performance-portable code. The research question is to identify the right level of abstraction, which includes techniques that range from template libraries all the way to completely new languages to achieve this goal
Parallel and Distributed Computing
The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing
Recommended from our members
Galois : a system for parallel execution of irregular algorithms
textA programming model which allows users to program with high productivity and which produces high performance executions has been a goal for decades. This dissertation makes progress towards this elusive goal by describing the design and implementation of the Galois system, a parallel programming model for shared-memory, multicore machines. Central to the design is the idea that scheduling of a program can be decoupled from the core computational operator and data structures. However, efficient programs often require application-specific scheduling to achieve best performance. To bridge this gap, an extensible and abstract scheduling policy language is proposed, which allows programmers to focus on selecting high-level scheduling policies while delegating the tedious task of implementing the policy to a scheduler synthesizer and runtime system. Implementations of deterministic and prioritized scheduling also are described. An evaluation of a well-studied benchmark suite reveals that factoring programs into operators, schedulers and data structures can produce significant performance improvements over unfactored approaches. Comparison of the Galois system with existing programming models for graph analytics shows significant performance improvements, often orders of magnitude more, due to (1) better support for the restrictive programming models of existing systems and (2) better support for more sophisticated algorithms and scheduling, which cannot be expressed in other systems.Computer Science
High-Performance Communication Primitives and Data Structures on Message-Passing Manycores:Broadcast and Map
The constant increase in single core frequency reached a plateau during recent years since the produced heat inside the chip cannot be cooled down by existing technologies anymore. An alternative to harvest more computational power per die is to fabricate more number of cores into a single chip. Therefore manycore chips with more than thousand cores are expected by the end of the decade. These environments provide a high level of parallel processing power while their energy consumption is considerably lower than their multi-chip counterparts. Although shared-memory programming is the classical paradigm to program these environments, there are numerous claims that taking into account the full life cycle of software, the message-passing programming model have numerous advantages. The direct architectural consequence of applying a message-passing programming model is to support message passing between the processing entities directly in the hardware. Therefore manycore architectures with hardware support for message passing are becoming more and more visible. These platforms can be seen in two ways: (i) as a High Performance Computing (HPC) cluster programmed by highly trained scientists using Message Passing Interface (MPI) libraries; or (ii) as a mainstream computing platform requiring a global operating system to abstract away the architectural complexities from the ordinary programmer. In the first view, performance of communication primitives is an important bottleneck for MPI applications. In the second view, kernel data structures have been shown to be a limiting factor. In this thesis (i) we overview existing state-of-the-art techniques to circumvent the mentioned bottlenecks; and (ii) we study high-performance broadcast communication primitive and map data structure on modern manycore architectures, with message-passing support in hardware, in two different chapters respectively. In one chapter, we study how to make use of the hardware features to implement an efficient broadcast primitive. We consider the Intel Single-chip Cloud Computer (SCC) as our target platform which offers the ability to move data between on-chip Message Passing Buffers (MPB) using Remote Memory Access (RMA). We propose OC-Bcast (On-Chip Broadcast), a pipelined k-ary tree algorithm tailored to exploit the parallelism provided by on-chip RMA. Experimental results show that OC-Bcast attains considerably better performance in terms of latency and throughput compared to state-of-the-art solutions. This performance improvement highlights the benefits of exploiting hardware features of the target platform: Our broadcast algorithm takes direct advantage of RMA, unlike the other broadcast solutions which are based on a higher-level send/receive interface. In the other chapter, we study the implementation of high-throughput concurrent maps in message-passing manycores. Partitioning and replication are the two approaches to achieve high throughput in a message-passing system. This chapter presents and compares different strongly-consistent map algorithms based on partitioning and replication. To assess the performance of these algorithms independently of architecture-specific features, we propose a communication model of message-passing manycores to express the throughput of each algorithm. The model is validated through experiments on a 36-core TILE-Gx8036 processor. Evaluations show that replication outperforms partitioning only in a narrow domain
Scalability in the Presence of Variability
Supercomputers are used to solve some of the world’s most computationally demanding
problems. Exascale systems, to be comprised of over one million cores and capable of 10^18
floating point operations per second, will probably exist by the early 2020s, and will provide
unprecedented computational power for parallel computing workloads. Unfortunately,
while these machines hold tremendous promise and opportunity for applications in High
Performance Computing (HPC), graph processing, and machine learning, it will be a major
challenge to fully realize their potential, because to do so requires balanced execution across
the entire system and its millions of processing elements. When different processors take different
amounts of time to perform the same amount of work, performance imbalance arises,
large portions of the system sit idle, and time and energy are wasted. Larger systems incorporate
more processors and thus greater opportunity for imbalance to arise, as well as larger
performance/energy penalties when it does. This phenomenon is referred to as performance
variability and is the focus of this dissertation.
In this dissertation, we explain how to design system software to mitigate variability
on large scale parallel machines. Our approaches span (1) the design, implementation, and
evaluation of a new high performance operating system to reduce some classes of performance
variability, (2) a new performance evaluation framework to holistically characterize
key features of variability on new and emerging architectures, and (3) a distributed modeling
framework that derives predictions of how and where imbalance is manifesting in order to
drive reactive operations such as load balancing and speed scaling. Collectively, these efforts
provide a holistic set of tools to promote scalability through the mitigation of variability
Resource-aware Programming in a High-level Language - Improved performance with manageable effort on clustered MPSoCs
Bis 2001 bedeutete Moores und Dennards Gesetz eine Verdoppelung der AusfĂĽhrungszeit alle 18 Monate durch verbesserte CPUs.
Heute ist Nebenläufigkeit das dominante Mittel zur Beschleunigung von Supercomputern bis zu mobilen Geräten.
Allerdings behindern neuere Phänomene wie "Dark Silicon" zunehmend eine weitere Beschleunigung durch Hardware.
Um weitere Beschleunigung zu erreichen muss sich auch die SoftÂware mehr ihrer Hardware Resourcen gewahr werden.
Verbunden mit diesem Phänomen ist eine immer heterogenere Hardware.
Supercomputer integrieren Beschleuniger wie GPUs.
Mobile SoCs (bspw. Smartphones) integrieren immer mehr Fähigkeiten.
Spezialhardware auszunutzen ist eine bekannte Methode, um den Energieverbrauch zu senken, was ein weiterer wichtiger Aspekt ist, welcher mit der reinen Geschwindigkeit abgewogen werde muss.
Zum Beispiel werden Supercomputer auch nach "Performance pro Watt" bewertet.
Zur Zeit sind systemnahe low-level Programmierer es gewohnt über Hardware nachzudenken, während der gemeine high-level Programmierer es vorzieht von der Plattform möglichst zu abstrahieren (bspw. Cloud).
"High-level" bedeutet nicht, dass Hardware irrelevant ist, sondern dass sie abstrahiert werden kann.
Falls Sie eine Java-Anwendung fĂĽr Android entwickeln, kann der Akku ein wichtiger Aspekt sein.
Irgendwann mĂĽssen aber auch Hochsprachen resourcengewahr werden, um Geschwindigkeit oder Energieverbrauch zu verbessern.
Innerhalb des Transregio "Invasive Computing" habe ich an diesen Problemen gearbeitet.
In meiner Dissertation stelle ich ein Framework vor, mit dem man Hochsprachenanwendungen resourcengewahr machen kann, um so die Leistung zu verbessern.
Das könnte beispielsweise erhöhte Effizienz oder schnellerer Ausführung für das System als Ganzes bringen.
Ein Kerngedanke dabei ist, dass Anwendungen sich nicht selbst optimieren.
Stattdessen geben sie alle Informationen an das Betriebssystem.
Das Betriebssystem hat eine globale Sicht und trifft Entscheidungen ĂĽber die Resourcen.
Diesen Prozess nennen wir "Invasion".
Die Aufgabe der Anwendung ist es, sich an diese Entscheidungen anzupassen, aber nicht selbst welche zu fällen.
Die Herausforderung besteht darin eine Sprache zu definieren, mit der Anwendungen Resourcenbedingungen und Leistungsinformationen kommunizieren.
So eine Sprache muss ausdrucksstark genug fĂĽr komplexe Informationen, erweiterbar fĂĽr neue Resourcentypen, und angenehm fĂĽr den Programmierer sein.
Die zentralen Beiträge dieser Dissertation sind:
Ein theoretisches Modell der Resourcen-Verwaltung, um die Essenz des resourcengewahren Frameworks zu beschreiben,
die Korrektheit der Entscheidungen des Betriebssystems bezĂĽglich der Bedingungen einer Anwendung zu begrĂĽnden
und zum Beweis meiner Thesen von Effizienz und Beschleunigung in der Theorie.
Ein Framework und eine Ăśbersetzungspfad resourcengewahrer Programmierung fĂĽr die Hochsprache X10.
Zur Bewertung des Ansatzes haben wir Anwendungen aus dem High Performance Computing implementiert.
Eine Beschleunigung von 5x konnte gemessen werden.
Ein Speicherkonsistenzmodell fĂĽr die X10 Programmiersprache, da dies ein notwendiger Schritt zu einer formalen Semantik ist, die das theoretische Modell und die konkrete Implementierung verknĂĽpft.
Zusammengefasst zeige ich, dass resourcengewahre Programmierung in Hoch\-sprachen auf zukĂĽnftigen Architekturen mit vielen Kernen mit vertretbarem Aufwand machbar ist und die Leistung verbessert
Specialization without complexity in heterogeneous memory systems
The end of Dennard scaling and Moore's law has motivated a rise in the use of parallelism and hardware specialization in computer system design. Across all compute domains, applications have increasingly relied on specialized devices such as GPUs, DSPs, FPGAs, etc., to execute tasks faster and more efficiently, but interfacing these diverse devices within a heterogeneous system remains an important challenge. Early heterogeneous systems were loosely coupled and lacked a shared coherent memory interface, so specialization was reserved for highly regular code patterns with coarse-grained synchronization requirements. More recently, the need to accelerate applications with more irregular and fine-grained sharing patterns has led to significant research into closer integration of specialized devices.
A single global address space enables improved programmability, communication efficiency, data reuse, and load balancing for emerging heterogeneous applications. Consequently, there have been many attempts to integrate specialized devices and their caches into a single coherent memory hierarchy to improve performance in future systems-on-chip (SoCs). However, coherence is particularly difficult to implement in heterogeneous systems. Differences in parallelism, locality, and synchronization in high-throughput accelerators such as GPUs means that coherence and consistency strategies designed for CPUs are ineffective, and evaluating the performance of alternative strategies is difficult. Recent efforts to implement coherence for such devices involve a simple software-driven coherence strategy combined with complex extensions to a conventional memory consistency model, which guarantees sequential consistency (SC) for programs that are data race-free (DRF). The first extension, scoped synchronization, avoids coherence costs when synchronization is guaranteed to be local, but it requires the use of the heterogeneous race-free (HRF) consistency model, which limits sharing patterns and increases the burden on the programmer. The second extension, relaxed atomics, allows the programmer to avoid costly ordering constraints when they are unnecessary for functionality, but existing consistency models offer complex and often poorly specified semantics when relaxed atomics are used. Once an appropriate coherence and consistency strategy is determined for a device, interfacing it with devices using different strategies poses another critical challenge. Existing integration strategies are incremental, either sacrificing system flexibility or incurring significant added complexity to achieve this goal. A rethinking of heterogeneous coherence and protocol integration from the ground up is needed.
This work lays out a path to implementing flexible and efficient heterogeneous coherence without adding complexity to the consistency model or the system design. To help understand the memory demands of emerging specialized hardware, we first describe a performance analysis tool we developed for highly parallel workloads. Insights from this tool helped guide the development of a collection of coherence and consistency innovations for high-throughput accelerators. On the coherence side, we describe two innovations, DeNovo for GPUs and heterogeneous lazy release consistency (hLRC), which demonstrate that scoped synchronization is not necessary for cache efficiency in high-throughput devices. On the consistency side, this work describes the DRFrlx consistency model, which formalizes safe use cases of atomic relaxation. Again, we offer these benefits while retaining a simple SC-centric DRF consistency model. Finally, to address the challenge of integrating diverse coherence strategies, we present the Spandex coherence interface. Spandex can flexibly and simply integrate devices with a broad range of memory demands in an SoC, and we show how this flexibility enables new performance optimizations that can take advantage of hints about the expected memory demands of an application. Together, these innovations establish a framework for integrating future SoCs that can dynamically adapt to serve the diverse memory demands of future accelerators without incurring complexity for hardware or software designers
GUMSMP: a scalable parallel Haskell implementation
The most widely available high performance platforms today are hierarchical,
with shared memory leaves, e.g. clusters of multi-cores, or NUMA with multiple
regions. The Glasgow Haskell Compiler (GHC) provides a number of parallel
Haskell implementations targeting different parallel architectures. In particular,
GHC-SMP supports shared memory architectures, and GHC-GUM supports
distributed memory machines. Both implementations use different, but related,
runtime system (RTS) mechanisms and achieve good performance. A specialised
RTS for the ubiquitous hierarchical architectures is lacking.
This thesis presents the design, implementation, and evaluation of a new
parallel Haskell RTS, GUMSMP, that combines shared and distributed memory
mechanisms to exploit hierarchical architectures more effectively. The design
evaluates a variety of design choices and aims to efficiently combine scalable
distributed memory parallelism, using a virtual shared heap over a hierarchical
architecture, with low-overhead shared memory parallelism on shared memory
nodes. Key design objectives in realising this system are to prefer local work,
and to exploit mostly passive load distribution with pre-fetching.
Systematic performance evaluation shows that the automatic hierarchical load
distribution policies must be carefully tuned to obtain good performance. We
investigate the impact of several policies including work pre-fetching, favouring
inter-node work distribution, and spark segregation with different export and
select policies. We present the performance results for GUMSMP, demonstrating
good scalability for a set of benchmarks on up to 300 cores. Moreover, our policies
provide performance improvements of up to a factor of 1.5 compared to GHC-
GUM.
The thesis provides a performance evaluation of distributed and shared heap
implementations of parallel Haskell on a state-of-the-art physical shared memory
NUMA machine. The evaluation exposes bottlenecks in memory management,
which limit scalability beyond 25 cores. We demonstrate that GUMSMP, that
combines both distributed and shared heap abstractions, consistently outper-
forms the shared memory GHC-SMP on seven benchmarks by a factor of 3.3
on average. Specifically, we show that the best results are obtained when shar-
ing memory only within a single NUMA region, and using distributed memory
system abstractions across the regions
- …