4,599 research outputs found

    Improving Software Performance in the Compute Unified Device Architecture

    Get PDF
    This paper analyzes several aspects regarding the improvement of software performance for applications written in the Compute Unified Device Architecture CUDA). We address an issue of great importance when programming a CUDA application: the Graphics Processing Unit’s (GPU’s) memory management through ranspose ernels. We also benchmark and evaluate the performance for progressively optimizing a transposing matrix application in CUDA. One particular interest was to research how well the optimization techniques, applied to software application written in CUDA, scale to the latest generation of general-purpose graphic processors units (GPGPU), like the Fermi architecture implemented in the GTX480 and the previous architecture implemented in GTX280. Lately, there has been a lot of interest in the literature for this type of optimization analysis, but none of the works so far (to our best knowledge) tried to validate if the optimizations can apply to a GPU from the latest Fermi architecture and how well does the Fermi architecture scale to these software performance improving techniques.Compute Unified Device Architecture, Fermi Architecture, Naive Transpose, Coalesced Transpose, Shared Memory Copy, Loop in Kernel, Loop over Kernel

    Improved Parallel Rabin-Karp Algorithm Using Compute Unified Device Architecture

    Full text link
    String matching algorithms are among one of the most widely used algorithms in computer science. Traditional string matching algorithms efficiency of underlaying string matching algorithm will greatly increase the efficiency of any application. In recent years, Graphics processing units are emerged as highly parallel processor. They out perform best of the central processing units in scientific computation power. By combining recent advancement in graphics processing units with string matching algorithms will allows to speed up process of string matching. In this paper we proposed modified parallel version of Rabin-Karp algorithm using graphics processing unit. Based on that, result of CPU as well as parallel GPU implementations are compared for evaluating effect of varying number of threads, cores, file size as well as pattern size.Comment: Information and Communication Technology for Intelligent Systems (ICTIS 2017

    Super Calculator using Compute Unified Device Architecture (CUDA)

    Get PDF
    Scientific computation requires a great amount of computing power especially in floating-point operation but a high-end multi-cores processor is currently limited in terms of floating point operation performance and parallelization. Recent technological advancement has made parallel computing technically and financially feasible using Compute Unified Device Architecture (CUDA) developed by NVIDIA. This research focuses on measuring the performance of CUDA and implementing CUDA for a scientific computation involving the process of porting the source code from CPU to GPU using direct integration technique. The ported source code is then optimized by managing the resources to achieve performance gain over CPU. It is found that CUDA is able to boost the performance of the system up to 69 times in Parboil Benchmark Suite. Successful attempt at porting Serpent encryption algorithm and Lattice Boltzmann Method provided up to 7 times throughput performance gain and up to 10 times execution time performance gain respectively over the CPU. Direct integration guideline for porting the source code is then produced based on the two implementations

    Parallelising wavefront applications on general-purpose GPU devices

    Get PDF
    Pipelined wavefront applications form a large portion of the high performance scientific computing workloads at supercomputing centres. This paper investigates the viability of graphics processing units (GPUs) for the acceleration of these codes, using NVIDIA's Compute Unified Device Architecture (CUDA). We identify the optimisations suitable for this new architecture and quantify the characteristics of those wavefront codes that are likely to experience speedups

    Super Calculator using Compute Unified Device Architecture (CUDA)

    Get PDF
    Scientific computation requires a great amount of computing power especially in floating-point operation but a high-end multi-cores processor is currently limited in terms of floating point operation performance and parallelization. Recent technological advancement has made parallel computing technically and financially feasible using Compute Unified Device Architecture (CUDA) developed by NVIDIA. This research focuses on measuring the performance of CUDA and implementing CUDA for a scientific computation involving the process of porting the source code from CPU to GPU using direct integration technique. The ported source code is then optimized by managing the resources to achieve performance gain over CPU. It is found that CUDA is able to boost the performance of the system up to 69 times in Parboil Benchmark Suite. Successful attempt at porting Serpent encryption algorithm and Lattice Boltzmann Method provided up to 7 times throughput performance gain and up to 10 times execution time performance gain respectively over the CPU. Direct integration guideline for porting the source code is then produced based on the two implementations

    A GPU-based Evolution Strategy for Optic Disk Detection in Retinal Images

    Get PDF
    La ejecución paralela de aplicaciones usando unidades de procesamiento gráfico (gpu) ha ganado gran interés en la comunidad académica en los años recientes. La computación paralela puede ser aplicada a las estrategias evolutivas para procesar individuos dentro de una población, sin embargo, las estrategias evolutivas se caracterizan por un significativo consumo de recursos computacionales al resolver problemas de gran tamaño o aquellos que se modelan mediante funciones de aptitud complejas. Este artículo describe la implementación de una estrategia evolutiva para la detección del disco óptico en imágenes de retina usando Compute Unified Device Architecture (cuda). Los resultados experimentales muestran que el tiempo de ejecución para la detección del disco óptico logra una aceleración de 5 a 7 veces, comparado con la ejecución secuencial en una cpu convencional.Parallel processing using graphic processing units (GPUs) has attracted much research interest in recent years. Parallel computation can be applied to evolution strategy (ES) for processing individuals in a population, but evolutionary strategies are time consuming to solve large computational problems or complex fitness functions. In this paper we describe the implementation of an improved ES for optic disk detection in retinal images using the Compute Unified Device Architecture (CUDA) environment. In the experimental results we show that the computational time for optic disk detection task has a speedup factor of 5x and 7x compared to an implementation on a mainstream CPU

    SOLUTIONS FOR OPTIMIZING THE DATA PARALLEL PREFIX SUM ALGORITHM USING THE COMPUTE UNIFIED DEVICE ARCHITECTURE

    Get PDF
    In this paper, we analyze solutions for optimizing the data parallel prefix sum function using the Compute Unified Device Architecture (CUDA) that provides a viable solution for accelerating a broad class of applications. The parallel prefix sum function is an essential building block for many data mining algorithms, and therefore its optimization facilitates the whole data mining process. Finally, we benchmark and evaluate the performance of the optimized parallel prefix sum building block in CUDA.CUDA, threads, GPGPU, parallel prefix sum, parallel processing, task synchronization, warp

    Parallel processing in Compute Unified Device Architecture (CUDA) for energy saving glass

    Get PDF
    Energy saving glass is used to keep the warmth and temperature of a building instead of using thermal radiator or machine to generate the heat all the time during winter. Yet the coating structure of the glass is mostly in regular shape such as tripoles and circular with a very small size that limits the useful signal such as wireless signal and radio frequency to pass through. One way to allow more useful signals to go through the glass is by using complex shape coating structure. In order to develop a complex coating structure, genetic algorithm technique is used in this research. However, genetic algorithm require a fast processing speed in order to cope with the process of creating new chromosome from the population and undergoes the selection, crossover and mutation operation processes. Hence parallel processing is needed to overcome this problem with the use of both CPU and GPU to eliminate the need of purchasing high performance CPU and the needs of adding additional repeaters to increase the wireless signals. The coating structure will be presented in binary bits in a text file that shows the best chromosome. The result will be analyzed in a simulation tool that uses to check for the signal transmission efficiency and rate loss
    corecore