6 research outputs found

    Performance analysis of broadcasting algorithms on the Intel Single-Chip Cloud Computer

    Full text link

    Efficient Communication and Synchronization on Manycore Processors

    Get PDF
    The increased number of cores integrated on a chip has brought about a number of challenges. Concerns about the scalability of cache coherence protocols have urged both researchers and practitioners to explore alternative programming models, where cache coherence is not a given. Message passing, traditionally used in distributed systems, has surfaced as an appealing alternative to shared memory, commonly used in multiprocessor systems. In this thesis, we study how basic communication and synchronization primitives on manycore processors can be improved, with an accent on taking advantage of message passing. We do this in two different contexts: (i) message passing is the only means of communication and (ii) it coexists with traditional cache-coherent shared memory. In the first part of the thesis, we analytically and experimentally study collective communication on a message-passing manycore processor. First, we devise broadcast algorithms for the Intel SCC, an experimental manycore platform without coherent caches. Our ideas are captured by OC-Bcast (on-chip broadcast), a tree-based broadcast algorithm. Two versions of OC-Bcast are presented: One for synchronous communication, suitable for use in high-performance libraries implementing the Message Passing Interface (MPI), and another for asynchronous communication, for use in distributed algorithms and general-purpose software. Both OC-Bcast flavors are based on one-sided communication and significantly outperform (by up to 3x) state-of-the-art two-sided algorithms. Next, we conceive an analytical communication model for the SCC. By expressing the latency and throughput of different broadcast algorithms through this model, we reveal that the advantage of OC-Bcast comes from greatly reducing the number of off-chip memory accesses on the critical path. The second part of the thesis focuses on lock-based synchronization. We start by introducing the concept of hybrid mutual exclusion algorithms, which rely both on cache-coherent shared memory and message passing. The hybrid algorithms we present, HybLock and HybComb, are shown to significantly outperform (by even 4x) their shared-memory-only counterparts, when used to implement concurrent counters, stacks and queues on a hybrid Tilera TILE-Gx processor. The advantage of our hybrid algorithms comes from the fact that their most critical parts rely on message passing, thereby avoiding the overhead of the cache coherence protocol. Still, we take advantage of shared memory, as shared state makes the implementation of certain mechanisms much more straightforward. Next, we try to profit from these insights even on processors without hardware support for message passing. Taking two classic x86 processors from Intel and AMD, we come up with cache-aware optimizations that improve the performance of executing contended critical sections by as much as 6x

    High-Performance Communication Primitives and Data Structures on Message-Passing Manycores:Broadcast and Map

    Get PDF
    The constant increase in single core frequency reached a plateau during recent years since the produced heat inside the chip cannot be cooled down by existing technologies anymore. An alternative to harvest more computational power per die is to fabricate more number of cores into a single chip. Therefore manycore chips with more than thousand cores are expected by the end of the decade. These environments provide a high level of parallel processing power while their energy consumption is considerably lower than their multi-chip counterparts. Although shared-memory programming is the classical paradigm to program these environments, there are numerous claims that taking into account the full life cycle of software, the message-passing programming model have numerous advantages. The direct architectural consequence of applying a message-passing programming model is to support message passing between the processing entities directly in the hardware. Therefore manycore architectures with hardware support for message passing are becoming more and more visible. These platforms can be seen in two ways: (i) as a High Performance Computing (HPC) cluster programmed by highly trained scientists using Message Passing Interface (MPI) libraries; or (ii) as a mainstream computing platform requiring a global operating system to abstract away the architectural complexities from the ordinary programmer. In the first view, performance of communication primitives is an important bottleneck for MPI applications. In the second view, kernel data structures have been shown to be a limiting factor. In this thesis (i) we overview existing state-of-the-art techniques to circumvent the mentioned bottlenecks; and (ii) we study high-performance broadcast communication primitive and map data structure on modern manycore architectures, with message-passing support in hardware, in two different chapters respectively. In one chapter, we study how to make use of the hardware features to implement an efficient broadcast primitive. We consider the Intel Single-chip Cloud Computer (SCC) as our target platform which offers the ability to move data between on-chip Message Passing Buffers (MPB) using Remote Memory Access (RMA). We propose OC-Bcast (On-Chip Broadcast), a pipelined k-ary tree algorithm tailored to exploit the parallelism provided by on-chip RMA. Experimental results show that OC-Bcast attains considerably better performance in terms of latency and throughput compared to state-of-the-art solutions. This performance improvement highlights the benefits of exploiting hardware features of the target platform: Our broadcast algorithm takes direct advantage of RMA, unlike the other broadcast solutions which are based on a higher-level send/receive interface. In the other chapter, we study the implementation of high-throughput concurrent maps in message-passing manycores. Partitioning and replication are the two approaches to achieve high throughput in a message-passing system. This chapter presents and compares different strongly-consistent map algorithms based on partitioning and replication. To assess the performance of these algorithms independently of architecture-specific features, we propose a communication model of message-passing manycores to express the throughput of each algorithm. The model is validated through experiments on a 36-core TILE-Gx8036 processor. Evaluations show that replication outperforms partitioning only in a narrow domain

    High-performance RMA-based Broadcast on the Intel SCC

    Get PDF
    Many-core chips with more than 1000 cores are expected by the end of the decade. To overcome scalability issues related to cache coherence at such a scale, one of the main research directions is to leverage the message-passing programming model. The Intel Single-Chip Cloud Computer (SCC) is a prototype of a message-passing many-core chip. It offers the ability to move data between on-chip Message Passing Buffers (MPB) using Remote Memory Access (RMA). Performance of message-passing applications is directly affected by efficiency of collective operations, such as broadcast. In this paper, we study how to make use of the MPBs to implement an efficient broadcast algorithm for the SCC. We propose OC-Bcast (On-Chip Broadcast), a pipelined k-ary tree algorithm tailored to exploit the parallelism provided by on-chip RMA. Using a LogP-based model, we present an analytical evaluation that compares our algorithm with the state-of-the-art broadcast algorithms implemented for the SCC. As predicted by the model, experimental results show that OC-Bcast attains almost three times better throughput, and improves latency by at least 27%. Furthermore, the analytical evaluation highlights the benefits of our approach: OC-Bcast takes direct advantage of RMA, unlike the other considered broadcast algorithms, which are based on a higher-level send/receive interface. This leads us to the conclusion that RMA-based collective operations are needed to take full advantage of hardware features of future message-passing many-core architectures

    Effizientes Programmiermodell fĂĽr OpenMP auf einem Cluster-basierten Many-Core-System

    Get PDF
    Da die Komplexität „System-on-Chip“ (SoC) auch weiterhin zunimmt, wird man die Herausforderungen aufgrund der Konvergenz der Software- und Hardwareentwicklung nicht ignorieren können. Dies gilt auch für den Umgang mit dem hierarchischen Design, in dem die Prozessorkerne in Clustern oder sogenannten „Tiles“ angeordnet werden, um mittels eines schnellen lokalen Speicherzugriffs eine geringe Latenz und eine hohe Bandbreite der lokalen Kommunikation zu gewährleisten. Aus der Sicht eines Programmierers ist es wünschenswert, sich diese Eigenheiten der Hardware zunutze zu machen und sie bei der Ausgestaltung der abstrakten Parallel-Programmierung gewissenhaft und zielführend zu berücksichtigen. Diese Dissertation überwindet viele Engpässe in Bezug auf die Skalierbarkeit Cluster-basierter Many-Core-Systeme und führt das Programmiermodell OpenMP zur Vereinfachung der Anwendungsentwicklung ein. OpenMP abstrahiert von der Sichtweise des Programmierers – und es werden Richtlinien eingeführt, mit denen Schleifen in Programmsequenzen eingeteilt werden, als Basis für die parallele Programmierung. In dieser Arbeit wird das OpenMP-Modell bespielhaft in einem konkreten Cluster-basierten Many-Core-System umgesetzt; dem Intel Single-Chip Cloud Computer (SCC). Es wird eine schlanke und hoch-optimierte Laufzeitschicht für die Ausführung von OpenMP sowie ein Speichermodell vorgestellt. Auf Basis dieser Laufzeitschicht wird der parallele Code automatisch von einem nativen Backend-Compiler (GCC 4.6) erzeugt, der mit der Laufzeitbibliothek verknüpft ist. Im Rahmen der Arbeit wird auf einen effizienten Designansatz für die OpenMP-Programmierung eingegangen, wobei der Intel SCC als Beispiel für Cluster-basierte Systeme zum Einsatz kommt. In nicht-Cache-kohärenten Systemen dient die SCC OpenMP Laufzeitbibliothek primär dazu, die folgenden Herausforderungen zu bewältigen: 1. Die Ausführung von unmodifizierten, bestehenden OpenMP Programmen auf solchen Systemen. 2. Die Portierung des OpenMP-Speichermodells auf den SCC. 3. Die Synchronisation der parallelen Threads, auf die ein beträchtlicher Anteil der Ausführungszeit einer Anwendung entfällt. Eine Reihe weiterer Beispiele, basierend auf verschiedenen gebräuchlichen Kernen und realen Anwendungen, untermauert die Tauglichkeit von OpenMP – und eine Reihe von Experimenten zeigt, wie dieses Modell zu einer deutlichen Beschleunigung (bis zu 48-fach) in verschiedenen parallelen Anwendungen führt.As the complexity of systems-on-chip (SoCs) continues to increase, it is no longer possible to ignore the challenges caused by the convergence of software and hardware development. This involves attempts to deal with the hierarchical design – in which several cores are grouped in clusters or tiles – to ensure low-latency, high-bandwidth local communication by relying on fast local memories. From a programmer’s perspec- tive, it is desirable to make use of these peculiarities of the hardware, which must be clearly and carefully taken into account when designing the support for high-level parallel programming models. This dissertation overcomes many scalability bottlenecks in cluster-based many-core systems and introduces the OpenMP programming model as a means of simplifying application development. OpenMP represents an abstraction of the programmer’s view by providing abundant directives that decompose loops in sequential programs and lead to parallel programs. In this work, the full OpenMP model is implemented on a specific instance of a cluster-based many-core system: the Intel Single-chip Cloud Computer (SCC). In this thesis, a lightweight and highly optimized runtime layer for OpenMP execution and memory model by generating the parallel code that is automatically compiled by native back-end compiler (GCC 4.6) that linked with the runtime library. In this dissertation, I will address an efficient design approach of the OpenMP pro- gramming model for the Intel SCC as an example for cluster-based systems. The SCC OpenMP runtime library is designed to cope with three main challenges in a non-cache coherent system: 1. Executing unmodified legacy OpenMP programs on such system. 2. Landing OpenMP memory model on the SCC. 3. Synchronization in the work of parallel threads accounts for a sizeable fraction of an application’s execution time. Furthermore, the effectiveness of OpenMP is demonstrated on a set of widely used kernels and real-world applications. An extensive set of experiments shows how this model achieves significant parallel speedups up to 48x in several applications
    corecore