2,116 research outputs found
Efficient Irregular Wavefront Propagation Algorithms on Hybrid CPU-GPU Machines
In this paper, we address the problem of efficient execution of a computation
pattern, referred to here as the irregular wavefront propagation pattern
(IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in
several image processing operations. In the IWPP, data elements in the
wavefront propagate waves to their neighboring elements on a grid if a
propagation condition is satisfied. Elements receiving the propagated waves
become part of the wavefront. This pattern results in irregular data accesses
and computations. We develop and evaluate strategies for efficient computation
and propagation of wavefronts using a multi-level queue structure. This queue
structure improves the utilization of fast memories in a GPU and reduces
synchronization overheads. We also develop a tile-based parallelization
strategy to support execution on multiple CPUs and GPUs. We evaluate our
approaches on a state-of-the-art GPU accelerated machine (equipped with 3 GPUs
and 2 multicore CPUs) using the IWPP implementations of two widely used image
processing operations: morphological reconstruction and euclidean distance
transform. Our results show significant performance improvements on GPUs. The
use of multiple CPUs and GPUs cooperatively attains speedups of 50x and 85x
with respect to single core CPU executions for morphological reconstruction and
euclidean distance transform, respectively.Comment: 37 pages, 16 figure
Bolt: Accelerated Data Mining with Fast Vector Compression
Vectors of data are at the heart of machine learning and data mining.
Recently, vector quantization methods have shown great promise in reducing both
the time and space costs of operating on vectors. We introduce a vector
quantization algorithm that can compress vectors over 12x faster than existing
techniques while also accelerating approximate vector operations such as
distance and dot product computations by up to 10x. Because it can encode over
2GB of vectors per second, it makes vector quantization cheap enough to employ
in many more circumstances. For example, using our technique to compute
approximate dot products in a nested loop can multiply matrices faster than a
state-of-the-art BLAS implementation, even when our algorithm must first
compress the matrices.
In addition to showing the above speedups, we demonstrate that our approach
can accelerate nearest neighbor search and maximum inner product search by over
100x compared to floating point operations and up to 10x compared to other
vector quantization methods. Our approximate Euclidean distance and dot product
computations are not only faster than those of related algorithms with slower
encodings, but also faster than Hamming distance computations, which have
direct hardware support on the tested platforms. We also assess the errors of
our algorithm's approximate distances and dot products, and find that it is
competitive with existing, slower vector quantization algorithms.Comment: Research track paper at KDD 201
A Similarity Measure for GPU Kernel Subgraph Matching
Accelerator architectures specialize in executing SIMD (single instruction,
multiple data) in lockstep. Because the majority of CUDA applications are
parallelized loops, control flow information can provide an in-depth
characterization of a kernel. CUDAflow is a tool that statically separates CUDA
binaries into basic block regions and dynamically measures instruction and
basic block frequencies. CUDAflow captures this information in a control flow
graph (CFG) and performs subgraph matching across various kernel's CFGs to gain
insights to an application's resource requirements, based on the shape and
traversal of the graph, instruction operations executed and registers
allocated, among other information. The utility of CUDAflow is demonstrated
with SHOC and Rodinia application case studies on a variety of GPU
architectures, revealing novel thread divergence characteristics that
facilitates end users, autotuners and compilers in generating high performing
code
GPU-based fast gamma index calcuation
The gamma-index dose comparison tool has been widely used to compare dose
distributions in cancer radiotherapy. The accurate calculation of gamma-index
requires an exhaustive search of the closest Euclidean distance in the
high-resolution dose-distance space. This is a computational intensive task
when dealing with 3D dose distributions. In this work, we combine a geometric
method with a radial pre-sorting technique , and implement them on computer
graphics processing units (GPUs). The developed GPU-based gamma-index
computational tool is evaluated on eight pairs of IMRT dose distributions. The
GPU implementation achieved 20x~30x speedup factor compared to CPU
implementation and gamma-index calculations can be finished within a few
seconds for all 3D testing cases. We further investigated the effect of various
factors on both CPU and GPU computation time. The strategy of pre-sorting
voxels based on their dose difference values speed up the GPU calculation by
about 2-4 times. For n-dimensional dose distributions, gamma-index calculation
time on CPU is proportional to the summation of gamma^n over all voxels, while
that on GPU is effected by gamma^n distributions and is approximately
proportional to the gamma^n summation over all voxels. We found increasing dose
distributions resolution leads to quadratic increase of computation time on
CPU, while less-than-quadratic increase on GPU. The values of dose difference
(DD) and distance-to-agreement (DTA) criteria also have their impact on
gamma-index calculation time.Comment: 13 pages, 2 figures, and 3 table
Graph Edge Bundling by Medial Axes
We present a new method for bundling edges of general graphs, based on 2D medial axes of edge sets which are similar in terms of position. We combine edge clustering, distance fields, and 2D medial axes to progressively bundle general graphs by attract-ing edges towards the centerlines of level sets of their distance fields. Our method allows for an efficient GPU implementation. We illustrate our method on several large real-world graphs
- …