16 research outputs found
Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?
Dense Multi-GPU systems have recently gained a lot of attention in the HPC
arena. Traditionally, MPI runtimes have been primarily designed for clusters
with a large number of nodes. However, with the advent of MPI+CUDA applications
and CUDA-Aware MPI runtimes like MVAPICH2 and OpenMPI, it has become important
to address efficient communication schemes for such dense Multi-GPU nodes. This
coupled with new application workloads brought forward by Deep Learning
frameworks like Caffe and Microsoft CNTK pose additional design constraints due
to very large message communication of GPU buffers during the training phase.
In this context, special-purpose libraries like NVIDIA NCCL have been proposed
for GPU-based collective communication on dense GPU systems. In this paper, we
propose a pipelined chain (ring) design for the MPI_Bcast collective operation
along with an enhanced collective tuning framework in MVAPICH2-GDR that enables
efficient intra-/inter-node multi-GPU communication. We present an in-depth
performance landscape for the proposed MPI_Bcast schemes along with a
comparative analysis of NVIDIA NCCL Broadcast and NCCL-based MPI_Bcast. The
proposed designs for MVAPICH2-GDR enable up to 14X and 16.6X improvement,
compared to NCCL-based solutions, for intra- and inter-node broadcast latency,
respectively. In addition, the proposed designs provide up to 7% improvement
over NCCL-based solutions for data parallel training of the VGG network on 128
GPUs using Microsoft CNTK.Comment: 8 pages, 3 figure
Efficient Broadcast for Multicast-Capable Interconnection Networks
The broadcast function MPI_Bcast() from the
MPI-1.1 standard is one of the most heavily
used collective operations for the message
passing programming paradigm.
This diploma thesis makes use of a feature called
"Multicast", which is supported by several
network technologies (like Ethernet or
InfiniBand), to create an efficient MPI_Bcast()
implementation, especially for large communicators
and small-sized messages.
A preceding analysis of existing real-world
applications leads to an algorithm which does not
only perform well for synthetical benchmarks
but also even better for a wide class of
parallel applications. The finally derived
broadcast has been implemented for the
open source MPI library "Open MPI" using
IP multicast.
The achieved results prove that
the new broadcast is usually always better
than existing point-to-point implementations,
as soon as the number of MPI processes exceeds the
8 node boundary. The performance gain reaches
a factor of 4.9 on 342 nodes, because the
new algorithm scales practically independently
of the number of involved processes.Die Broadcastfunktion MPI_Bcast() aus dem MPI-1.1
Standard ist eine der meistgenutzten kollektiven
Kommunikationsoperationen des nachrichtenbasierten
Programmierparadigmas.
Diese Diplomarbeit nutzt die Multicastfähigkeit,
die von mehreren Netzwerktechnologien (wie Ethernet
oder InfiniBand) bereitgestellt wird, um eine
effiziente MPI_Bcast() Implementation zu erschaffen,
insbesondere für große Kommunikatoren und kleinere
Nachrichtengrößen.
Eine vorhergehende Analyse von existierenden
parallelen Anwendungen führte dazu, dass der neue
Algorithmus nicht nur bei synthetischen Benchmarks
gut abschneidet, sondern sein Potential bei echten
Anwendungen noch besser entfalten kann. Der
letztendlich daraus entstandene Broadcast wurde
für die Open-Source MPI Bibliothek "Open MPI"
entwickelt und basiert auf IP Multicast.
Die erreichten Ergebnisse belegen, dass der neue
Broadcast üblicherweise immer besser als jegliche
Punkt-zu-Punkt Implementierungen ist, sobald die
Anzahl von MPI Prozessen die Grenze von 8 Knoten
überschreitet. Der Geschwindigkeitszuwachs
erreicht einen Faktor von 4,9 bei 342 Knoten,
da der neue Algorithmus praktisch unabhängig
von der Knotenzahl skaliert
Optimizing Communication for Massively Parallel Processing
The current trends in high performance computing show that large machines with tens of thousands of processors will soon be readily available. The IBM Bluegene-L machine with 128k processors (which is currently being deployed) is an important step in this direction. In this scenario, it is going to be a significant burden for the programmer to manually scale his applications. This task of scaling involves addressing issues like load-imbalance and communication overhead. In this thesis, we explore several communication optimizations to help parallel applications to easily scale on a large number of processors. We also present automatic runtime techniques to relieve the programmer from the burden of optimizing communication in his applications.
This thesis explores processor virtualization to improve communication performance in applications. With processor virtualization, the computation is mapped to virtual processors (VPs). After one VP has finished computation and is waiting for responses to its messages, another VP can compute, thus overlapping communication with computation. This overlap is only effective if the processor overhead of the communication operation is a small fraction of the total communication time. Fortunately, with network interfaces having co-processors, this happens to be true and processor virtualization has a natural advantage on such interconnects.
The communication optimizations we present in this thesis, are motivated by applications such as NAMD (a classical molecular dynamics application) and CPAIMD (a quantum chemistry application). Applications like NAMD and CPAIMD consume a fair share of the time available on supercomputers. So, improving their performance would be of great value. We have successfully scaled NAMD to 1TF of peak performance on 3000 processors of PSC Lemieux, using the techniques presented in this thesis.
We study both point-to-point communication and collective communication (specifically all-to-all communication). On a large number of processors all-to-all communication can take several milli-seconds to finish. With synchronous collectives defined in MPI, the processor idles while the collective messages are in flight. Therefore, we demonstrate an asynchronous collective communication framework, to let the CPU compute while the all-to-all messages are in flight. We also show that the best strategy for all-to-all communication depends on the message size, number of processors and other dynamic parameters. This suggests that these parameters can be observed at runtime and used to choose the optimal strategy for all-to-all communication. In this thesis, we demonstrate adaptive strategy switching for all-to-all communication.
The communication optimization framework presented in this thesis, has been designed to optimize communication in the context of processor virtualization and dynamic migrating objects. We present the streaming strategy to optimize fine grained object-to-object communication.
In this thesis, we motivate the need for hardware collectives, as processor based collectives can be delayed by intermediate that processors busy with computation. We explore a next generation interconnect that supports collectives in the switching hardware. We show the performance gains of hardware collectives through synthetic benchmarks
Kernel-assisted and Topology-aware MPI Collective Communication among Multicore or Many-core Clusters
Multicore or many-core clusters have become the most prominent form of High Performance Computing (HPC) systems. Hardware complexity and hierarchies not only exist in the inter-node layer, i.e., hierarchical networks, but also exist in internals of multicore compute nodes, e.g., Non Uniform Memory Accesses (NUMA), network-style interconnect, and memory and shared cache hierarchies.
Message Passing Interface (MPI), the most widely adopted in the HPC communities, suffers from decreased performance and portability due to increased hardware complexity of multiple levels. We identified three critical issues specific to collective communication: The first problem arises from the gap between logical collective topologies and underlying hardware topologies; Second, current MPI communications lack efficient shared memory message delivering approaches; Last, on distributed memory machines, like multicore clusters, a single approach cannot encompass the extreme variations not only in the bandwidth and latency capabilities, but also in features such as the aptitude to operate multiple concurrent copies simultaneously.
To bridge the gap between logical collective topologies and hardware topologies, we developed a distance-aware framework to integrate the knowledge of hardware distance into collective algorithms in order to dynamically reshape the communication patterns to suit the hardware capabilities. Based on process distance information, we used graph partitioning techniques to organize the MPI processes in a multi-level hierarchy, mapping on the hardware characteristics. Meanwhile, we took advantage of the kernel-assisted one-sided single-copy approach (KNEM) as the default shared memory delivering method. Via kernel-assisted memory copy, the collective algorithms offload copy tasks onto non-leader/not-root processes to evenly distribute copy workloads among available cores. Finally, on distributed memory machines, we developed a technique to compose multi-layered collective algorithms together to express a multi-level algorithm with tight interoperability between the levels. This tight collaboration results in more overlaps between inter- and intra-node communication.
Experimental results have confirmed that, by leveraging several technologies together, such as kernel-assisted memory copy, the distance-aware framework, and collective algorithm composition, not only do MPI collectives reach the potential maximum performance on a wide variation of platforms, but they also deliver a level of performance immune to modifications of the underlying process-core binding
Light-Weight Remote Communication for High-Performance Cloud Networks
Während der letzten 10 Jahre gewann das Cloud Computing immer weiter an Bedeutung. Um kosten zu sparen installieren immer mehr Anwender ihre Anwendungen in der Cloud, statt eigene Hardware zu kaufen und zu betreiben. Als Reaktion entstanden große Rechenzentren, die ihren Kunden Rechnerkapazität zum Betreiben eigener Anwendungen zu günstigen Preisen anbieten. Diese Rechenzentren verwenden momentan gewöhnliche Rechnerhardware, die zwar leistungsstark ist, aber hohe Anschaffungs- und Stromkosten verursacht. Aus diesem Grund werden momentan neue Hardwarearchitekturen mit schwächeren aber energieeffizienteren CPUs entwickelt. Wir glauben, dass in zukünftiger Cloudhardware außerdem Netzwerkhardware mit Zusatzfunktionen wie user-level I/O oder remote DMA zum Einsatz kommt, um die CPUs zu entlasten.
Aktuelle Cloud-Plattformen setzen meist bekannte Betriebssysteme wie Linux oder Microsoft Windows ein, um Kompatibilität mit existierender Software zu gewährleisten. Diese Betriebssysteme beinhalten oft keine Unterstützung für die speziellen Funktionen zukünftiger Netzwerkhardware. Stattdessen verwenden sie traditionell software-basierte Netzwerkstacks, die auf TCP/IP und dem Berkeley-Socket-Interface basieren. Besonders das Socket-Interface ist mit Funktionen wie remote DMA weitgehend inkompatibel, da seine Semantik auf Datenströmen basiert, während remote DMA-Anfragen sich eher wie in sich abgeschlossene Nachrichten verhalten.
In der vorliegenden Arbeit beschreiben wir LibRIPC, eine leichtgewichtige Kommunikationsbibliothek für Cloud-Anwendungen. LibRIPC verbessert die Leistung zukünftiger Netzwerkhardware signifikant, ohne dabei die von Anwendungen benötigte Flexibilität zu vernachlässigen. Anstatt Sockets bietet LibRIPC eine nachrichtenbasierte Schnittstelle an, zwei Funktionen zum senden von Daten implementiert: Eine Funktion für kurze Nachrichten, die auf niedrige Latenz optimiert ist, sowie eine Funktion für lange Nachrichten, die durch die Nutzung von remote DMA-Funktionalität hohe Datendurchsätze erreicht. Übertragene Daten werden weder beim Senden noch beim Empfangen kopiert, um die Übertragungslatenz zu minimieren. LibRIPC nutzt den vollen Funktionsumfang der Hardware aus, versteckt die Hardwarefunktionen aber gleichzeitig vor der Anwendung, um die Hardwareunabhängigkeit der Anwendung zu gewährleisten. Um Flexibilität zu erreichen verwendet die Bibliothek ein eigenes Adressschema, dass sowohl von der verwendeten Hardware als auch von physischen Maschinen unabhängig ist. Hardwareabhängige Adressen werden dynamisch zur Laufzeit aufgelöst, was starten, stoppen und migrieren von Prozessen zu beliebigen Zeitpunkten erlaubt.
Um unsere Lösung zu Bewerten implementierten wir einen Prototypen auf Basis von InfiniBand. Dieser Prototyp nutzt die Vorteile von InfiniBand, um effiziente Datenübertragungen zu ermöglichen, und vermeidet gleichzeitig die Nachteile von InfiniBand, indem er die Ergebnisse langwieriger Operationen speichert und wiederverwendet. Wir führten Experimente auf Basis dieses Prototypen und des Webservers Jetty durch. Zu diesem Zweck integrierten wir Jetty in das Hadoop map/reduce framework, um realistische Lastbedingungen zu erzeugen. Während dabei die effiziente Integration von LibRIPC und Jetty vergleichsweise einfach war, erwies sich die Integration von LibRIPC und Hadoop als deutlich schwieriger: Um unnötiges Kopieren von Daten zu vermeiden, währen weitgehende Änderungen an der Codebasis von Hadoop erforderlich. Dennoch legen unsere Ergebnisse nahe, dass LibRIPC Datendurchsatz, Latenz und Overhead gegenüber Socketbasierter Kommunikation deutlich verbessert
Accelerating Network Communication and I/O in Scientific High Performance Computing Environments
High performance computing has become one of the major drivers behind technology inventions and science discoveries. Originally driven through the increase of operating frequencies and technology scaling, a recent slowdown in this evolution has led to the development of multi-core architectures, which are supported by accelerator devices such as graphics processing units (GPUs). With the upcoming exascale era, the overall power consumption and the gap between compute capabilities and I/O bandwidth have become major challenges. Nowadays, the system performance is dominated by the time spent in communication and I/O, which highly depends on the capabilities of the network interface. In order to cope with the extreme concurrency and heterogeneity of future systems, the software ecosystem of the interconnect needs to be carefully tuned to excel in reliability, programmability, and usability.
This work identifies and addresses three major gaps in today's interconnect software systems. The I/O gap describes the disparity in operating speeds between the computing capabilities and second storage tiers. The communication gap is introduced through the communication overhead needed to synchronize distributed large-scale applications and the mixed workload. The last gap is the so called concurrency gap, which is introduced through the extreme concurrency and the inflicted learning curve posed to scientific application developers to exploit the hardware capabilities.
The first contribution is the introduction of the network-attached accelerator approach, which moves accelerators into a "stand-alone" cluster connected through the Extoll interconnect. The novel communication architecture enables the direct accelerators communication without any host interactions and an optimal application-to-compute-resources mapping. The effectiveness of this approach is evaluated for two classes of accelerators: Intel Xeon Phi coprocessors and NVIDIA GPUs.
The next contribution comprises the design, implementation, and evaluation of the support of legacy codes and protocols over the Extoll interconnect technology. By providing TCP/IP protocol support over Extoll, it is shown that the performance benefits of the interconnect can be fully leveraged by a broader range of applications, including the seamless support of legacy codes.
The third contribution is twofold. First, a comprehensive analysis of the Lustre networking protocol semantics and interfaces is presented. Afterwards, these insights are utilized to map the LNET protocol semantics onto the Extoll networking technology. The result is a fully functional Lustre network driver for Extoll. An initial performance evaluation demonstrates promising bandwidth and message rate results.
The last contribution comprises the design, implementation, and evaluation of two easy-to-use load balancing frameworks, which transparently distribute the I/O workload across all available storage system components. The solutions maximize the parallelization and throughput of file I/O. The frameworks are evaluated on the Titan supercomputing systems for three I/O interfaces. For example for large-scale application runs, POSIX I/O and MPI-IO can be improved by up to 50% on a per job basis, while HDF5 shows performance improvements of up to 32%
Efficient Communication and Synchronization on Manycore Processors
The increased number of cores integrated on a chip has brought about a number of challenges. Concerns about the scalability of cache coherence protocols have urged both researchers and practitioners to explore alternative programming models, where cache coherence is not a given. Message passing, traditionally used in distributed systems, has surfaced as an appealing alternative to shared memory, commonly used in multiprocessor systems. In this thesis, we study how basic communication and synchronization primitives on manycore processors can be improved, with an accent on taking advantage of message passing. We do this in two different contexts: (i) message passing is the only means of communication and (ii) it coexists with traditional cache-coherent shared memory. In the first part of the thesis, we analytically and experimentally study collective communication on a message-passing manycore processor. First, we devise broadcast algorithms for the Intel SCC, an experimental manycore platform without coherent caches. Our ideas are captured by OC-Bcast (on-chip broadcast), a tree-based broadcast algorithm. Two versions of OC-Bcast are presented: One for synchronous communication, suitable for use in high-performance libraries implementing the Message Passing Interface (MPI), and another for asynchronous communication, for use in distributed algorithms and general-purpose software. Both OC-Bcast flavors are based on one-sided communication and significantly outperform (by up to 3x) state-of-the-art two-sided algorithms. Next, we conceive an analytical communication model for the SCC. By expressing the latency and throughput of different broadcast algorithms through this model, we reveal that the advantage of OC-Bcast comes from greatly reducing the number of off-chip memory accesses on the critical path. The second part of the thesis focuses on lock-based synchronization. We start by introducing the concept of hybrid mutual exclusion algorithms, which rely both on cache-coherent shared memory and message passing. The hybrid algorithms we present, HybLock and HybComb, are shown to significantly outperform (by even 4x) their shared-memory-only counterparts, when used to implement concurrent counters, stacks and queues on a hybrid Tilera TILE-Gx processor. The advantage of our hybrid algorithms comes from the fact that their most critical parts rely on message passing, thereby avoiding the overhead of the cache coherence protocol. Still, we take advantage of shared memory, as shared state makes the implementation of certain mechanisms much more straightforward. Next, we try to profit from these insights even on processors without hardware support for message passing. Taking two classic x86 processors from Intel and AMD, we come up with cache-aware optimizations that improve the performance of executing contended critical sections by as much as 6x
Energy Demand Response for High-Performance Computing Systems
The growing computational demand of scientific applications has greatly motivated the development of large-scale high-performance computing (HPC) systems in the past decade. To accommodate the increasing demand of applications, HPC systems have been going through dramatic architectural changes (e.g., introduction of many-core and multi-core systems, rapid growth of complex interconnection network for efficient communication between thousands of nodes), as well as significant increase in size (e.g., modern supercomputers consist of hundreds of thousands of nodes). With such changes in architecture and size, the energy consumption by these systems has increased significantly. With the advent of exascale supercomputers in the next few years, power consumption of the HPC systems will surely increase; some systems may even consume hundreds of megawatts of electricity. Demand response programs are designed to help the energy service providers to stabilize the power system by reducing the energy consumption of participating systems during the time periods of high demand power usage or temporary shortage in power supply.
This dissertation focuses on developing energy-efficient demand-response models and algorithms to enable HPC system\u27s demand response participation. In the first part, we present interconnection network models for performance prediction of large-scale HPC applications. They are based on interconnected topologies widely used in HPC systems: dragonfly, torus, and fat-tree. Our interconnect models are fully integrated with an implementation of message-passing interface (MPI) that can mimic most of its functions with packet-level accuracy. Extensive experiments show that our integrated models provide good accuracy for predicting the network behavior, while at the same time allowing for good parallel scaling performance. In the second part, we present an energy-efficient demand-response model to reduce HPC systems\u27 energy consumption during demand response periods. We propose HPC job scheduling and resource provisioning schemes to enable HPC system\u27s emergency demand response participation. In the final part, we propose an economic demand-response model to allow both HPC operator and HPC users to jointly reduce HPC system\u27s energy cost. Our proposed model allows the participation of HPC systems in economic demand-response programs through a contract-based rewarding scheme that can incentivize HPC users to participate in demand response