3,450 research outputs found

    SOLUTIONS FOR OPTIMIZING THE DATA PARALLEL PREFIX SUM ALGORITHM USING THE COMPUTE UNIFIED DEVICE ARCHITECTURE

    Get PDF
    In this paper, we analyze solutions for optimizing the data parallel prefix sum function using the Compute Unified Device Architecture (CUDA) that provides a viable solution for accelerating a broad class of applications. The parallel prefix sum function is an essential building block for many data mining algorithms, and therefore its optimization facilitates the whole data mining process. Finally, we benchmark and evaluate the performance of the optimized parallel prefix sum building block in CUDA.CUDA, threads, GPGPU, parallel prefix sum, parallel processing, task synchronization, warp

    Architecture-Aware Optimization on a 1600-core Graphics Processor

    Get PDF
    The graphics processing unit (GPU) continues to make significant strides as an accelerator in commodity cluster computing for high-performance computing (HPC). For example, three of the top five fastest supercomputers in the world, as ranked by the TOP500, employ GPUs as accelerators. Despite this increasing interest in GPUs, however, optimizing the performance of a GPU-accelerated compute node requires deep technical knowledge of the underlying architecture. Although significant literature exists on how to optimize GPU performance on the more mature NVIDIA CUDA architecture, the converse is true for OpenCL on the AMD GPU. Consequently, we present and evaluate architecture-aware optimizations for the AMD GPU. The most prominent optimizations include (i) explicit use of registers, (ii) use of vector types, (iii) removal of branches, and (iv) use of image memory for global data. We demonstrate the efficacy of our AMD GPU optimizations by applying each optimization in isolation as well as in concert to a large-scale, molecular modeling application called GEM. Via these AMD-specific GPU optimizations, the AMD Radeon HD 5870 GPU delivers 65% better performance than with the wellknown NVIDIA-specific optimizations

    The AXIOM software layers

    Get PDF
    AXIOM project aims at developing a heterogeneous computing board (SMP-FPGA).The Software Layers developed at the AXIOM project are explained.OmpSs provides an easy way to execute heterogeneous codes in multiple cores. People and objects will soon share the same digital network for information exchange in a world named as the age of the cyber-physical systems. The general expectation is that people and systems will interact in real-time. This poses pressure onto systems design to support increasing demands on computational power, while keeping a low power envelop. Additionally, modular scaling and easy programmability are also important to ensure these systems to become widespread. The whole set of expectations impose scientific and technological challenges that need to be properly addressed.The AXIOM project (Agile, eXtensible, fast I/O Module) will research new hardware/software architectures for cyber-physical systems to meet such expectations. The technical approach aims at solving fundamental problems to enable easy programmability of heterogeneous multi-core multi-board systems. AXIOM proposes the use of the task-based OmpSs programming model, leveraging low-level communication interfaces provided by the hardware. Modular scalability will be possible thanks to a fast interconnect embedded into each module. To this aim, an innovative ARM and FPGA-based board will be designed, with enhanced capabilities for interfacing with the physical world. Its effectiveness will be demonstrated with key scenarios such as Smart Video-Surveillance and Smart Living/Home (domotics).Peer ReviewedPostprint (author's final draft
    corecore