317 research outputs found

    Efficient Irregular Wavefront Propagation Algorithms on Hybrid CPU-GPU Machines

    Full text link
    In this paper, we address the problem of efficient execution of a computation pattern, referred to here as the irregular wavefront propagation pattern (IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in several image processing operations. In the IWPP, data elements in the wavefront propagate waves to their neighboring elements on a grid if a propagation condition is satisfied. Elements receiving the propagated waves become part of the wavefront. This pattern results in irregular data accesses and computations. We develop and evaluate strategies for efficient computation and propagation of wavefronts using a multi-level queue structure. This queue structure improves the utilization of fast memories in a GPU and reduces synchronization overheads. We also develop a tile-based parallelization strategy to support execution on multiple CPUs and GPUs. We evaluate our approaches on a state-of-the-art GPU accelerated machine (equipped with 3 GPUs and 2 multicore CPUs) using the IWPP implementations of two widely used image processing operations: morphological reconstruction and euclidean distance transform. Our results show significant performance improvements on GPUs. The use of multiple CPUs and GPUs cooperatively attains speedups of 50x and 85x with respect to single core CPU executions for morphological reconstruction and euclidean distance transform, respectively.Comment: 37 pages, 16 figure

    An efficient implementation of the Bellman-Ford algorithm for Kepler GPU architectures

    Get PDF
    Finding the shortest paths from a single source to all other vertices is a common problem in graph analysis. The Bellman-Ford's algorithm is the solution that solves such a single-source shortest path (SSSP) problem and better applies to be parallelized for many-core architectures. Nevertheless, the high degree of parallelism is guaranteed at the cost of low work efficiency, which, compared to similar algorithms in literature (e.g., Dijkstra's) involves much more redundant work and a consequent waste of power consumption. This article presents a parallel implementation of the Bellman-Ford algorithm that exploits the architectural characteristics of recent GPU architectures (i.e., NVIDIA Kepler, Maxwell) to improve both performance and work efficiency. The article presents different optimizations to the implementation, which are oriented both to the algorithm and to the architecture. The experimental results show that the proposed implementation provides an average speedup of 5x higher than the existing most efficient parallel implementations for SSSP, that it works on graphs where those implementations cannot work or are inefficient (e.g., graphs with negative weight edges, sparse graphs), and that it sensibly reduces the redundant work caused by the parallelization process

    A Safety-First Approach to Memory Models.

    Full text link
    Sequential consistency (SC) is arguably the most intuitive behavior for a shared-memory multithreaded program. It is widely accepted that language-level SC could significantly improve programmability of a multiprocessor system. However, efficiently supporting end-to-end SC remains a challenge as it requires that both compiler and hardware optimizations preserve SC semantics. Current concurrent languages support a relaxed memory model that requires programmers to explicitly annotate all memory accesses that can participate in a data-race ("unsafe" accesses). This requirement allows compiler and hardware to aggressively optimize unannotated accesses, which are assumed to be data-race-free ("safe" accesses), while still preserving SC semantics. However, unannotated data races are easy for programmers to accidentally introduce and are difficult to detect, and in such cases the safety and correctness of programs are significantly compromised. This dissertation argues instead for a safety-first approach, whereby every memory operation is treated as potentially unsafe by the compiler and hardware unless it is proven otherwise. The first solution, DRFx memory model, allows many common compiler and hardware optimizations (potentially SC-violating) on unsafe accesses and uses a runtime support to detect potential SC violations arising from reordering of unsafe accesses. On detecting a potential SC violation, execution is halted before the safety property is compromised. The second solution takes a different approach and preserves SC in both compiler and hardware. Both SC-preserving compiler and hardware are also built on the safety-first approach. All memory accesses are treated as potentially unsafe by the compiler and hardware. SC-preserving hardware relies on different static and dynamic techniques to identify safe accesses. Our results indicate that supporting SC at the language level is not expensive in terms of performance and hardware complexity. The dissertation also explores an extension of this safety-first approach for data-parallel accelerators such as Graphics Processing Units (GPUs). Significant microarchitectural differences between CPU and GPU require rethinking of efficient solutions for preserving SC in GPUs. The proposed solution based on our SC-preserving approach performs nearly on par with the baseline GPU that implements a data-race-free-0 memory model.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120794/1/ansingh_1.pd

    Optimizing the Performance of Directive-based Programming Model for GPGPUs

    Get PDF
    Accelerators have been deployed on most major HPC systems. They are considered to improve the performance of many applications. Accelerators such as GPUs have an immense potential in terms of high compute capacity but programming these devices is a challenge. OpenCL, CUDA and other vendor-specific models for accelerator programming definitely offer high performance, but these are low-level models that demand excellent programming skills; moreover, they are time consuming to write and debug. In order to simplify GPU programming, several directive-based programming models have been proposed, including HMPP, PGI accelerator model and OpenACC. OpenACC has now become established as the de facto standard. We evaluate and compare these models involving several scientific applications. To study the implementation challenges and the principles and techniques of directive- based models, we built an open source OpenACC compiler on top of a main stream compiler framework (OpenUH as a branch of Open64). In this dissertation, we present the required techniques to parallelize and optimize the applications ported with OpenACC programming model. We apply both user-level optimizations in the applications and compiler and runtime-driven optimizations. The compiler optimization focuses on the parallelization of reduction operations inside nested parallel loops. To fully utilize all GPU resources, we also extend the OpenACC model to support multiple GPUs in a single node. Our application porting experience also revealed the challenge of choosing good loop schedules. The default loop schedule chosen by the compiler may not produce the best performance, so the user has to manually try different loop schedules to improve the performance. To solve this issue, we developed a locality-aware auto-tuning framework which is based on the proposed memory access cost model to help the compiler choose optimal loop schedules and guide the user to choose appropriate loop schedules.Computer Science, Department o

    Graph Algorithms on GPUs

    Get PDF
    This chapter introduces the topic of graph algorithms on GPUs. It starts by presenting and comparing the main important data structures and techniques applied for representing and analysing graphs on GPUs at the state of the art.It then presents the theory and an updated review of the most efficient implementations of graph algorithms for GPUs. In particular, the chapter focuses on graph traversal algorithms (breadth-first search), single-source shortest path(Djikstra, Bellman-Ford, delta stepping, hybrids), and all-pair shortest path (Floyd-Warshall). By the end of the chapter, load balancing and memory access techniques are discussed through an overview of their main issues and management techniques

    Doctor of Philosophy

    Get PDF
    dissertationAs the base of the software stack, system-level software is expected to provide ecient and scalable storage, communication, security and resource management functionalities. However, there are many computationally expensive functionalities at the system level, such as encryption, packet inspection, and error correction. All of these require substantial computing power. What's more, today's application workloads have entered gigabyte and terabyte scales, which demand even more computing power. To solve the rapidly increased computing power demand at the system level, this dissertation proposes using parallel graphics pro- cessing units (GPUs) in system software. GPUs excel at parallel computing, and also have a much faster development trend in parallel performance than central processing units (CPUs). However, system-level software has been originally designed to be latency-oriented. GPUs are designed for long-running computation and large-scale data processing, which are throughput-oriented. Such mismatch makes it dicult to t the system-level software with the GPUs. This dissertation presents generic principles of system-level GPU computing developed during the process of creating our two general frameworks for integrating GPU computing in storage and network packet processing. The principles are generic design techniques and abstractions to deal with common system-level GPU computing challenges. Those principles have been evaluated in concrete cases including storage and network packet processing applications that have been augmented with GPU computing. The signicant performance improvement found in the evaluation shows the eectiveness and eciency of the proposed techniques and abstractions. This dissertation also presents a literature survey of the relatively young system-level GPU computing area, to introduce the state of the art in both applications and techniques, and also their future potentials
    • …
    corecore