3,598 research outputs found
Scalable Breadth-First Search on a GPU Cluster
On a GPU cluster, the ratio of high computing power to communication
bandwidth makes scaling breadth-first search (BFS) on a scale-free graph
extremely challenging. By separating high and low out-degree vertices, we
present an implementation with scalable computation and a model for scalable
communication for BFS and direction-optimized BFS. Our communication model uses
global reduction for high-degree vertices, and point-to-point transmission for
low-degree vertices. Leveraging the characteristics of degree separation, we
reduce the graph size to one third of the conventional edge list
representation. With several other optimizations, we observe linear weak
scaling as we increase the number of GPUs, and achieve 259.8 GTEPS on a
scale-33 Graph500 RMAT graph with 124 GPUs on the latest CORAL early access
system.Comment: 12 pages, 13 figures. To appear at IPDPS 201
The Iray Light Transport Simulation and Rendering System
While ray tracing has become increasingly common and path tracing is well
understood by now, a major challenge lies in crafting an easy-to-use and
efficient system implementing these technologies. Following a purely
physically-based paradigm while still allowing for artistic workflows, the Iray
light transport simulation and rendering system allows for rendering complex
scenes by the push of a button and thus makes accurate light transport
simulation widely available. In this document we discuss the challenges and
implementation choices that follow from our primary design decisions,
demonstrating that such a rendering system can be made a practical, scalable,
and efficient real-world application that has been adopted by various companies
across many fields and is in use by many industry professionals today
TensorFlow Doing HPC
TensorFlow is a popular emerging open-source programming framework supporting
the execution of distributed applications on heterogeneous hardware. While
TensorFlow has been initially designed for developing Machine Learning (ML)
applications, in fact TensorFlow aims at supporting the development of a much
broader range of application kinds that are outside the ML domain and can
possibly include HPC applications. However, very few experiments have been
conducted to evaluate TensorFlow performance when running HPC workloads on
supercomputers. This work addresses this lack by designing four traditional HPC
benchmark applications: STREAM, matrix-matrix multiply, Conjugate Gradient (CG)
solver and Fast Fourier Transform (FFT). We analyze their performance on two
supercomputers with accelerators and evaluate the potential of TensorFlow for
developing HPC applications. Our tests show that TensorFlow can fully take
advantage of high performance networks and accelerators on supercomputers.
Running our TensorFlow STREAM benchmark, we obtain over 50% of theoretical
communication bandwidth on our testing platform. We find an approximately 2x,
1.7x and 1.8x performance improvement when increasing the number of GPUs from
two to four in the matrix-matrix multiply, CG and FFT applications
respectively. All our performance results demonstrate that TensorFlow has high
potential of emerging also as HPC programming framework for heterogeneous
supercomputers.Comment: Accepted for publication at The Ninth International Workshop on
Accelerators and Hybrid Exascale Systems (AsHES'19
SHADHO: Massively Scalable Hardware-Aware Distributed Hyperparameter Optimization
Computer vision is experiencing an AI renaissance, in which machine learning
models are expediting important breakthroughs in academic research and
commercial applications. Effectively training these models, however, is not
trivial due in part to hyperparameters: user-configured values that control a
model's ability to learn from data. Existing hyperparameter optimization
methods are highly parallel but make no effort to balance the search across
heterogeneous hardware or to prioritize searching high-impact spaces. In this
paper, we introduce a framework for massively Scalable Hardware-Aware
Distributed Hyperparameter Optimization (SHADHO). Our framework calculates the
relative complexity of each search space and monitors performance on the
learning task over all trials. These metrics are then used as heuristics to
assign hyperparameters to distributed workers based on their hardware. We first
demonstrate that our framework achieves double the throughput of a standard
distributed hyperparameter optimization framework by optimizing SVM for MNIST
using 150 distributed workers. We then conduct model search with SHADHO over
the course of one week using 74 GPUs across two compute clusters to optimize
U-Net for a cell segmentation task, discovering 515 models that achieve a lower
validation loss than standard U-Net.Comment: 10 pages, 6 figure
Gunrock: A High-Performance Graph Processing Library on the GPU
For large-scale graph analytics on the GPU, the irregularity of data access
and control flow, and the complexity of programming GPUs have been two
significant challenges for developing a programmable high-performance graph
library. "Gunrock", our graph-processing system designed specifically for the
GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on
operations on a vertex or edge frontier. Gunrock achieves a balance between
performance and expressiveness by coupling high performance GPU computing
primitives and optimization strategies with a high-level programming model that
allows programmers to quickly develop new graph primitives with small code size
and minimal GPU programming knowledge. We evaluate Gunrock on five key graph
primitives and show that Gunrock has on average at least an order of magnitude
speedup over Boost and PowerGraph, comparable performance to the fastest GPU
hardwired primitives, and better performance than any other GPU high-level
graph library.Comment: 14 pages, accepted by PPoPP'16 (removed the text repetition in the
previous version v5
Efficient Irregular Wavefront Propagation Algorithms on Hybrid CPU-GPU Machines
In this paper, we address the problem of efficient execution of a computation
pattern, referred to here as the irregular wavefront propagation pattern
(IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in
several image processing operations. In the IWPP, data elements in the
wavefront propagate waves to their neighboring elements on a grid if a
propagation condition is satisfied. Elements receiving the propagated waves
become part of the wavefront. This pattern results in irregular data accesses
and computations. We develop and evaluate strategies for efficient computation
and propagation of wavefronts using a multi-level queue structure. This queue
structure improves the utilization of fast memories in a GPU and reduces
synchronization overheads. We also develop a tile-based parallelization
strategy to support execution on multiple CPUs and GPUs. We evaluate our
approaches on a state-of-the-art GPU accelerated machine (equipped with 3 GPUs
and 2 multicore CPUs) using the IWPP implementations of two widely used image
processing operations: morphological reconstruction and euclidean distance
transform. Our results show significant performance improvements on GPUs. The
use of multiple CPUs and GPUs cooperatively attains speedups of 50x and 85x
with respect to single core CPU executions for morphological reconstruction and
euclidean distance transform, respectively.Comment: 37 pages, 16 figure
- …