2,116 research outputs found

    Efficient Irregular Wavefront Propagation Algorithms on Hybrid CPU-GPU Machines

    Full text link
    In this paper, we address the problem of efficient execution of a computation pattern, referred to here as the irregular wavefront propagation pattern (IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in several image processing operations. In the IWPP, data elements in the wavefront propagate waves to their neighboring elements on a grid if a propagation condition is satisfied. Elements receiving the propagated waves become part of the wavefront. This pattern results in irregular data accesses and computations. We develop and evaluate strategies for efficient computation and propagation of wavefronts using a multi-level queue structure. This queue structure improves the utilization of fast memories in a GPU and reduces synchronization overheads. We also develop a tile-based parallelization strategy to support execution on multiple CPUs and GPUs. We evaluate our approaches on a state-of-the-art GPU accelerated machine (equipped with 3 GPUs and 2 multicore CPUs) using the IWPP implementations of two widely used image processing operations: morphological reconstruction and euclidean distance transform. Our results show significant performance improvements on GPUs. The use of multiple CPUs and GPUs cooperatively attains speedups of 50x and 85x with respect to single core CPU executions for morphological reconstruction and euclidean distance transform, respectively.Comment: 37 pages, 16 figure

    Bolt: Accelerated Data Mining with Fast Vector Compression

    Full text link
    Vectors of data are at the heart of machine learning and data mining. Recently, vector quantization methods have shown great promise in reducing both the time and space costs of operating on vectors. We introduce a vector quantization algorithm that can compress vectors over 12x faster than existing techniques while also accelerating approximate vector operations such as distance and dot product computations by up to 10x. Because it can encode over 2GB of vectors per second, it makes vector quantization cheap enough to employ in many more circumstances. For example, using our technique to compute approximate dot products in a nested loop can multiply matrices faster than a state-of-the-art BLAS implementation, even when our algorithm must first compress the matrices. In addition to showing the above speedups, we demonstrate that our approach can accelerate nearest neighbor search and maximum inner product search by over 100x compared to floating point operations and up to 10x compared to other vector quantization methods. Our approximate Euclidean distance and dot product computations are not only faster than those of related algorithms with slower encodings, but also faster than Hamming distance computations, which have direct hardware support on the tested platforms. We also assess the errors of our algorithm's approximate distances and dot products, and find that it is competitive with existing, slower vector quantization algorithms.Comment: Research track paper at KDD 201

    A Similarity Measure for GPU Kernel Subgraph Matching

    Full text link
    Accelerator architectures specialize in executing SIMD (single instruction, multiple data) in lockstep. Because the majority of CUDA applications are parallelized loops, control flow information can provide an in-depth characterization of a kernel. CUDAflow is a tool that statically separates CUDA binaries into basic block regions and dynamically measures instruction and basic block frequencies. CUDAflow captures this information in a control flow graph (CFG) and performs subgraph matching across various kernel's CFGs to gain insights to an application's resource requirements, based on the shape and traversal of the graph, instruction operations executed and registers allocated, among other information. The utility of CUDAflow is demonstrated with SHOC and Rodinia application case studies on a variety of GPU architectures, revealing novel thread divergence characteristics that facilitates end users, autotuners and compilers in generating high performing code

    GPU-based fast gamma index calcuation

    Full text link
    The gamma-index dose comparison tool has been widely used to compare dose distributions in cancer radiotherapy. The accurate calculation of gamma-index requires an exhaustive search of the closest Euclidean distance in the high-resolution dose-distance space. This is a computational intensive task when dealing with 3D dose distributions. In this work, we combine a geometric method with a radial pre-sorting technique , and implement them on computer graphics processing units (GPUs). The developed GPU-based gamma-index computational tool is evaluated on eight pairs of IMRT dose distributions. The GPU implementation achieved 20x~30x speedup factor compared to CPU implementation and gamma-index calculations can be finished within a few seconds for all 3D testing cases. We further investigated the effect of various factors on both CPU and GPU computation time. The strategy of pre-sorting voxels based on their dose difference values speed up the GPU calculation by about 2-4 times. For n-dimensional dose distributions, gamma-index calculation time on CPU is proportional to the summation of gamma^n over all voxels, while that on GPU is effected by gamma^n distributions and is approximately proportional to the gamma^n summation over all voxels. We found increasing dose distributions resolution leads to quadratic increase of computation time on CPU, while less-than-quadratic increase on GPU. The values of dose difference (DD) and distance-to-agreement (DTA) criteria also have their impact on gamma-index calculation time.Comment: 13 pages, 2 figures, and 3 table

    Graph Edge Bundling by Medial Axes

    Get PDF
    We present a new method for bundling edges of general graphs, based on 2D medial axes of edge sets which are similar in terms of position. We combine edge clustering, distance fields, and 2D medial axes to progressively bundle general graphs by attract-ing edges towards the centerlines of level sets of their distance fields. Our method allows for an efficient GPU implementation. We illustrate our method on several large real-world graphs
    • …
    corecore