20 research outputs found
Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?
Dense Multi-GPU systems have recently gained a lot of attention in the HPC
arena. Traditionally, MPI runtimes have been primarily designed for clusters
with a large number of nodes. However, with the advent of MPI+CUDA applications
and CUDA-Aware MPI runtimes like MVAPICH2 and OpenMPI, it has become important
to address efficient communication schemes for such dense Multi-GPU nodes. This
coupled with new application workloads brought forward by Deep Learning
frameworks like Caffe and Microsoft CNTK pose additional design constraints due
to very large message communication of GPU buffers during the training phase.
In this context, special-purpose libraries like NVIDIA NCCL have been proposed
for GPU-based collective communication on dense GPU systems. In this paper, we
propose a pipelined chain (ring) design for the MPI_Bcast collective operation
along with an enhanced collective tuning framework in MVAPICH2-GDR that enables
efficient intra-/inter-node multi-GPU communication. We present an in-depth
performance landscape for the proposed MPI_Bcast schemes along with a
comparative analysis of NVIDIA NCCL Broadcast and NCCL-based MPI_Bcast. The
proposed designs for MVAPICH2-GDR enable up to 14X and 16.6X improvement,
compared to NCCL-based solutions, for intra- and inter-node broadcast latency,
respectively. In addition, the proposed designs provide up to 7% improvement
over NCCL-based solutions for data parallel training of the VGG network on 128
GPUs using Microsoft CNTK.Comment: 8 pages, 3 figure
Efficient Intra-Rack Resource Disaggregation for HPC Using Co-Packaged DWDM Photonics
The diversity of workload requirements and increasing hardware heterogeneity
in emerging high performance computing (HPC) systems motivate resource
disaggregation. Resource disaggregation allows compute and memory resources to
be allocated individually as required to each workload. However, it is unclear
how to efficiently realize this capability and cost-effectively meet the
stringent bandwidth and latency requirements of HPC applications. To that end,
we describe how modern photonics can be co-designed with modern HPC racks to
implement flexible intra-rack resource disaggregation and fully meet the bit
error rate (BER) and high escape bandwidth of all chip types in modern HPC
racks. Our photonic-based disaggregated rack provides an average application
speedup of 11% (46% maximum) for 25 CPU and 61% for 24 GPU benchmarks compared
to a similar system that instead uses modern electronic switches for
disaggregation. Using observed resource usage from a production system, we
estimate that an iso-performance intra-rack disaggregated HPC system using
photonics would require 4x fewer memory modules and 2x fewer NICs than a
non-disaggregated baseline.Comment: 15 pages, 12 figures, 4 tables. Published in IEEE Cluster 202
XSEDE: eXtreme Science and Engineering Discovery Environment Third Quarter 2012 Report
The Extreme Science and Engineering Discovery Environment (XSEDE) is the most advanced, powerful, and robust collection of integrated digital resources and services in the world. It is an integrated cyberinfrastructure ecosystem with singular interfaces for allocations, support, and other key services that researchers can use to interactively share computing resources, data, and expertise.This a report of project activities and highlights from the third quarter of 2012.National Science Foundation, OCI-105357
Mapping applications onto FPGA-centric clusters
High Performance Computing (HPC) is becoming increasingly important throughout science and engineering as ever more complex problems must be solved through computational simulations. In these large computational applications, the latency of communication between processing nodes is often the key factor that limits performance. An emerging alternative computer architecture that addresses the latency problem is the FPGA-centric cluster (FCC); in these systems, the devices (FPGAs) are directly interconnected and thus many layers of hardware and software are avoided. The result can be scalability not currently achievable with other technologies.
In FCCs, FPGAs serve multiple functions: accelerator, network interface card (NIC), and router. Moreover, because FPGAs are configurable, there is substantial opportunity to tailor the router hardware to the application; previous work has demonstrated that such application-aware configuration can effect a substantial improvement in hardware efficiency. One constraint of FCCs is that it is convenient for their interconnect to be static, direct, and have a two or three dimensional mesh topology. Thus, applications that are naturally of a different dimensionality (have a different logical topology) from that of the FCC must be remapped to obtain optimal performance.
In this thesis we study various aspects of the mapping problem for FCCs. There are two major research thrusts. The first is finding the optimal mapping of logical to physical topology. This problem has received substantial attention by both the theory community, where topology mapping is referred to as graph embedding, and by the High Performance Computing (HPC) community, where it is a question of process placement. We explore the implications of the different mapping strategies on communication behavior in FCCs, especially on resulting load imbalance.
The second major research thrust is built around the hypothesis that applications that need to be remapped (due to differing logical and physical topologies) will have different optimal router configurations from those applications that do not. For example, due to remapping, some virtual or physical communication links may have little occupancy; therefore fewer resources should be allocated to them. Critical here is the creation of a new set of parameterized hardware features that can be configured to best handle load imbalances caused by remapping. These two thrusts form a codesign loop: certain mapping algorithms may be differentially optimal due to application-aware router reconfiguration that accounts for this mapping.
This thesis has four parts. The first part introduces the background and previous work related to communication in general and, in particular, how it is implemented in FCCs. We build on previous work on application-aware router configuration. The second part introduces topology mapping mechanisms including those derived from graph embeddings and a greedy algorithm commonly used in HPC. In the third part, topology mappings are evaluated for performance and imbalance; we note that different mapping strategies lead to different imbalances both in the overall network and in each node. The final part introduces reconfigure router design that allocates resources based on different imbalance situations caused by different mapping behaviors
Routing on the Channel Dependency Graph:: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks
In the pursuit for ever-increasing compute power, and with Moore's law slowly coming to an end, high-performance computing started to scale-out to larger systems. Alongside the increasing system size, the interconnection network is growing to accommodate and connect tens of thousands of compute nodes. These networks have a large influence on total cost, application performance, energy consumption, and overall system efficiency of the supercomputer. Unfortunately, state-of-the-art routing algorithms, which define the packet paths through the network, do not utilize this important resource efficiently. Topology-aware routing algorithms become increasingly inapplicable, due to irregular topologies, which either are irregular by design, or most often a result of hardware failures. Exchanging faulty network components potentially requires whole system downtime further increasing the cost of the failure. This management approach becomes more and more impractical due to the scale of today's networks and the accompanying steady decrease of the mean time between failures. Alternative methods of operating and maintaining these high-performance interconnects, both in terms of hardware- and software-management, are necessary to mitigate negative effects experienced by scientific applications executed on the supercomputer. However, existing topology-agnostic routing algorithms either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables.
Using the fail-in-place strategy, a well-established method for storage systems to repair only critical component failures, is a feasible solution for current and future HPC interconnects as well as other large-scale installations such as data center networks. Although, an appropriate combination of topology and routing algorithm is required to minimize the throughput degradation for the entire system. This thesis contributes a network simulation toolchain to facilitate the process of finding a suitable combination, either during system design or while it is in operation. On top of this foundation, a key contribution is a novel scheduling-aware routing, which reduces fault-induced throughput degradation while improving overall network utilization. The scheduling-aware routing performs frequent property preserving routing updates to optimize the path balancing for simultaneously running batch jobs. The increased deployment of lossless interconnection networks, in conjunction with fail-in-place modes of operation and topology-agnostic, scheduling-aware routing algorithms, necessitates new solutions to solve the routing-deadlock problem. Therefore, this thesis further advances the state-of-the-art by introducing a novel concept of routing on the channel dependency graph, which allows the design of an universally applicable destination-based routing capable of optimizing the path balancing without exceeding a given number of virtual channels, which are a common hardware limitation. This disruptive innovation enables implicit deadlock-avoidance during path calculation, instead of solving both problems separately as all previous solutions
Satellite Networks: Architectures, Applications, and Technologies
Since global satellite networks are moving to the forefront in enhancing the national and global information infrastructures due to communication satellites' unique networking characteristics, a workshop was organized to assess the progress made to date and chart the future. This workshop provided the forum to assess the current state-of-the-art, identify key issues, and highlight the emerging trends in the next-generation architectures, data protocol development, communication interoperability, and applications. Presentations on overview, state-of-the-art in research, development, deployment and applications and future trends on satellite networks are assembled
Hardware Support for Efficient Packet Processing
Scalability is the key ingredient to further increase the performance of today’s supercomputers.
As other approaches like frequency scaling reach their limits, parallelization is the
only feasible way to further improve the performance. The time required for communication
needs to be kept as small as possible to increase the scalability, in order to be able to
further parallelize such systems.
In the first part of this thesis ways to reduce the inflicted latency in packet based interconnection
networks are analyzed and several new architectural solutions are proposed to
solve these issues. These solutions have been tested and proven in a field programmable
gate array (FPGA) environment. In addition, a hardware (HW) structure is presented that
enables low latency packet processing for financial markets.
The second part and the main contribution of this thesis is the newly designed crossbar
architecture. It introduces a novel way to integrate the ability to multicast in a crossbar
design. Furthermore, an efficient implementation of adaptive routing to reduce the
congestion vulnerability in packet based interconnection networks is shown. The low
latency of the design is demonstrated through simulation and its scalability is proven with
synthesis results.
The third part concentrates on the improvements and modifications made to EXTOLL, a
high performance interconnection network specifically designed for low latency and high
throughput applications. Contributions are modules enabling an efficient integration of
multiple host interfaces as well as the integration of the on-chip interconnect. Additionally,
some of the already existing functionality has been revised and improved to reach better
performance and a lower latency. Micro-benchmark results are presented to underline the
contribution of the made modifications
High-Performance MPI Broadcast Algorithm for Grid Environments Utilizing Multi-lane NICs
The performance of MPI collective operations, such as broadcast and reduction, is heavily affected by network topologies, especially in grid environments. Many techniques to construct efficient broadcast trees have been proposed for grids. On the other hand, recent high performance computing nodes are often equipped with multi-lane network interface cards (NICs), most previous collective communication methods fail to harness effectively. Our new broadcast algorithm for grid environments harnesses almost all downward and upward bandwidths of multi-lane NICs; A message to be broadcast is split into two pieces, which are broadcast along two independent binary trees in a pipelined fashion, and swapped between both trees. The salient feature of our algorithm is generality; it works effectively on both large clusters and grid environments. It can be also applied to nodes with a single NIC, by making multiple sockets share the NIC. Experimentations on a emulated network environment show that we achieve higher performance than traditional methods, regardless of network topologies or the message sizes.