2,749 research outputs found

    A New Data Layout For Set Intersection on GPUs

    Full text link
    Set intersection is the core in a variety of problems, e.g. frequent itemset mining and sparse boolean matrix multiplication. It is well-known that large speed gains can, for some computational problems, be obtained by using a graphics processing unit (GPU) as a massively parallel computing device. However, GPUs require highly regular control flow and memory access patterns, and for this reason previous GPU methods for intersecting sets have used a simple bitmap representation. This representation requires excessive space on sparse data sets. In this paper we present a novel data layout, "BatMap", that is particularly well suited for parallel processing, and is compact even for sparse data. Frequent itemset mining is one of the most important applications of set intersection. As a case-study on the potential of BatMaps we focus on frequent pair mining, which is a core special case of frequent itemset mining. The main finding is that our method is able to achieve speedups over both Apriori and FP-growth when the number of distinct items is large, and the density of the problem instance is above 1%. Previous implementations of frequent itemset mining on GPU have not been able to show speedups over the best single-threaded implementations.Comment: A version of this paper appears in Proceedings of IPDPS 201

    A New Data Layout for Set Intersection on GPUs

    Full text link

    Introducing Parallelism to the Ranges TS

    Get PDF
    The current interface provided by the C++17 parallel algorithms poses some limitations with respect to parallel data access and heterogeneous systems, such as personal computers and server nodes with GPUs, smartphones, and embedded System on a Chip chipsets. In this paper, we present a summary of why we believe the Ranges TS solves these problems, and also improves both programmability and performance on heterogeneous platforms. The complete paper has been submitted to WG21 for consideration, and here we present a summary of the changes proposed alongside new performance results. To the best of our knowledge, this is the first paper presented to WG21 that unifies the Ranges TS with the parallel algorithms introduced in C++17. Although there are various points of intersection, we will focus on the composability of functions, and the benefit that this brings to accelerator devices via kernel fusion

    Evaluating the Performance of Vulkan GLSL Compute Shaders in Real-Time Ray-Traced Audio Propagation Through 3D Virtual Environments

    Get PDF
    Real time ray tracing is a growing area of interest with applications in audio processing. However, real time audio processing comes with strict performance requirements, which parallel computing is often used to overcome. As graphics processing units (GPUs) have become more powerful and programmable, general-purpose computing on graphics processing units (GPGPU) has allowed GPUs to become extremely powerful parallel processors, leading them to become more prevalent in the domain of audio processing through platforms such as CUDA. The aim of this research was to investigate the potential of GLSL compute shaders in the domain of real time audio processing. Specifically regarding real time ray tracing tasks. To do this a number of GLSL compute shaders were created, along with a C++ Vulkan application with which to execute them. These shaders facilitate the propagation of audio, using ray tracing, through a virtual environment, and implement 3D space partitioning and ray intersection prediction in order to gauge the effectiveness of these optimisations for this task. Statistically significant results show that the GLSL compute shaders successfully propagated audio through a virtual environment, returning results to the host system in real time, within 30 milliseconds. However, while this capability was shown, significantly detailed virtual environments prevented results from being returned in real time. Indicating a potential for future research and optimisation

    From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation

    Full text link
    Starting from a high-level problem description in terms of partial differential equations using abstract tensor notation, the Chemora framework discretizes, optimizes, and generates complete high performance codes for a wide range of compute architectures. Chemora extends the capabilities of Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient manner for complex applications, without low-level code tuning. Chemora achieves parallelism through MPI and multi-threading, combining OpenMP and CUDA. Optimizations include high-level code transformations, efficient loop traversal strategies, dynamically selected data and instruction cache usage strategies, and JIT compilation of GPU code tailored to the problem characteristics. The discretization is based on higher-order finite differences on multi-block domains. Chemora's capabilities are demonstrated by simulations of black hole collisions. This problem provides an acid test of the framework, as the Einstein equations contain hundreds of variables and thousands of terms.Comment: 18 pages, 4 figures, accepted for publication in Scientific Programmin

    Efficient Yet Deep Convolutional Neural Networks for Semantic Segmentation

    Full text link
    Semantic Segmentation using deep convolutional neural network pose more complex challenge for any GPU intensive task. As it has to compute million of parameters, it results to huge memory consumption. Moreover, extracting finer features and conducting supervised training tends to increase the complexity. With the introduction of Fully Convolutional Neural Network, which uses finer strides and utilizes deconvolutional layers for upsampling, it has been a go to for any image segmentation task. In this paper, we propose two segmentation architecture which not only needs one-third the parameters to compute but also gives better accuracy than the similar architectures. The model weights were transferred from the popular neural net like VGG19 and VGG16 which were trained on Imagenet classification data-set. Then we transform all the fully connected layers to convolutional layers and use dilated convolution for decreasing the parameters. Lastly, we add finer strides and attach four skip architectures which are element-wise summed with the deconvolutional layers in steps. We train and test on different sparse and fine data-sets like Pascal VOC2012, Pascal-Context and NYUDv2 and show how better our model performs in this tasks. On the other hand our model has a faster inference time and consumes less memory for training and testing on NVIDIA Pascal GPUs, making it more efficient and less memory consuming architecture for pixel-wise segmentation.Comment: 8 page
    corecore