229 research outputs found

    Irregular accesses reorder unit: improving GPGPU memory coalescing for graph-based workloads

    Get PDF
    GPGPU architectures have become the dominant platform for massively parallel workloads, delivering high performance and energy efficiency for popular applications such as machine learning, computer vision or self-driving cars. However, irregular applications, such as graph processing, fail to fully exploit GPGPU resources due to their divergent memory accesses that saturate the memory hierarchy. To reduce the pressure on the memory subsystem for divergent memory-intensive applications, programmers must take into account SIMT execution model and memory coalescing in GPGPUs, devoting significant efforts in complex optimization techniques. Despite these efforts, we show that irregular graph processing still suffers from low GPGPU performance. We observe that in many irregular applications the mapping of data to threads can be safely changed. In other words, it is possible to relax the strict relationship between thread and data processed to reduce memory divergence. Based on this observation, we propose the Irregular accesses Reorder Unit (IRU), a novel hardware extension tightly integrated in the GPGPU pipeline. The IRU reorders data processed by the threads on irregular accesses to improve memory coalescing, i.e., it tries to assign data elements to threads as to produce coalesced accesses in SIMT groups. Furthermore, the IRU is capable of filtering and merging duplicated accesses, significantly reducing the workload. Programmers can easily utilize the IRU with a simple API, or let the compiler issue instructions from our extended ISA. We evaluate our proposal for state-of-the-art graph-based algorithms and a wide selection of applications. Results show that the IRU achieves a memory coalescing improvement of 1.32x and a 46% reduction in the overall traffic in the memory hierarchy, which results in 1.33x speedup and 13% energy savings on average, while incurring in a small 5.6% area overhead.This work has been supported by the CoCoUnit ERC Advanced Grant of the EUโ€™s Horizon 2020 program (grant No 833057), the Spanish State Research Agency (MCIN/AEI) under grant PID2020-113172RB-I00 and the ICREA Academia program.Peer ReviewedPostprint (published version

    Doctor of Philosophy

    Get PDF
    dissertationAs the base of the software stack, system-level software is expected to provide ecient and scalable storage, communication, security and resource management functionalities. However, there are many computationally expensive functionalities at the system level, such as encryption, packet inspection, and error correction. All of these require substantial computing power. What's more, today's application workloads have entered gigabyte and terabyte scales, which demand even more computing power. To solve the rapidly increased computing power demand at the system level, this dissertation proposes using parallel graphics pro- cessing units (GPUs) in system software. GPUs excel at parallel computing, and also have a much faster development trend in parallel performance than central processing units (CPUs). However, system-level software has been originally designed to be latency-oriented. GPUs are designed for long-running computation and large-scale data processing, which are throughput-oriented. Such mismatch makes it dicult to t the system-level software with the GPUs. This dissertation presents generic principles of system-level GPU computing developed during the process of creating our two general frameworks for integrating GPU computing in storage and network packet processing. The principles are generic design techniques and abstractions to deal with common system-level GPU computing challenges. Those principles have been evaluated in concrete cases including storage and network packet processing applications that have been augmented with GPU computing. The signicant performance improvement found in the evaluation shows the eectiveness and eciency of the proposed techniques and abstractions. This dissertation also presents a literature survey of the relatively young system-level GPU computing area, to introduce the state of the art in both applications and techniques, and also their future potentials

    GPU ์—๋Ÿฌ ์•ˆ์ •์„ฑ ๋ณด์žฅ์„ ์œ„ํ•œ ์ปดํŒŒ์ผ๋Ÿฌ ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2020. 8. ์ด์žฌ์ง„.Due to semiconductor technology scaling and near-threshold voltage computing, soft error resilience has become more important. Nowadays, GPUs are widely used in high performance computing (HPC) because of its efficient parallel processing and modern GPUs designed for HPC use error correction code (ECC) to protect their storage including register files. However, adopting ECC in the register file imposes high area and energy overhead. To replace the expensive hardware cost of ECC, we propose Penny, a lightweight compiler-directed resilience scheme for GPU register file protection. We combine recent advances in idempotent recovery with low-cost error detection code. Our approach focuses on solving two important problems: 1. Can we guarantee correct error recovery using idempotent execution with error detection code? We show that when an error detection code is used with idempotence recovery, certain restrictions required by previous idempotent recovery schemes are no longer needed. We also propose a software-based scheme to prevent the checkpoint value from being overwritten before the end of the region where the value is required for correct recovery. 2. How do we reduce the execution overhead caused by checkpointing? In GPUs additional checkpointing store instructions inflicts considerably higher overhead compared to CPUs, due to its architectural characteristics, such as lack of store buffers. We propose a number of compiler optimizations techniques that significantly reduce the overhead.๋ฐ˜๋„์ฒด ๋ฏธ์„ธ๊ณต์ • ๊ธฐ์ˆ ์ด ๋ฐœ์ „ํ•˜๊ณ  ๋ฌธํ„ฑ์ „์•• ๊ทผ์ฒ˜ ์ปดํ“จํŒ…(near-threashold voltage computing)์ด ๋„์ž…๋จ์— ๋”ฐ๋ผ์„œ ์†Œํ”„ํŠธ ์—๋Ÿฌ๋กœ๋ถ€ํ„ฐ์˜ ๋ณต์›์ด ์ค‘์š”ํ•œ ๊ณผ์ œ๊ฐ€ ๋˜์—ˆ๋‹ค. ๊ฐ•๋ ฅํ•œ ๋ณ‘๋ ฌ ๊ณ„์‚ฐ ์„ฑ๋Šฅ์„ ์ง€๋‹Œ GPU๋Š” ๊ณ ์„ฑ๋Šฅ ์ปดํ“จํŒ…์—์„œ ์ค‘์š”ํ•œ ์œ„์น˜๋ฅผ ์ฐจ์ง€ํ•˜๊ฒŒ ๋˜์—ˆ๊ณ , ์Šˆํผ ์ปดํ“จํ„ฐ์—์„œ ์“ฐ์ด๋Š” GPU๋“ค์€ ์—๋Ÿฌ ๋ณต์› ์ฝ”๋“œ์ธ ECC๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ ˆ์ง€์Šคํ„ฐ ํŒŒ์ผ ๋ฐ ๋ฉ”๋ชจ๋ฆฌ ๋“ฑ์— ์ €์žฅ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณดํ˜ธํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋ ˆ์ง€์Šคํ„ฐ ํŒŒ์ผ์— ECC๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ํฐ ํ•˜๋“œ์›จ์–ด๋‚˜ ์—๋„ˆ์ง€ ๋น„์šฉ์„ ํ•„์š”๋กœ ํ•œ๋‹ค. ์ด๋Ÿฐ ๊ฐ’๋น„์‹ผ ECC์˜ ํ•˜๋“œ์›จ์–ด ๋น„์šฉ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ปดํŒŒ์ผ๋Ÿฌ ๊ธฐ๋ฐ˜์˜ ์ €๋น„์šฉ GPU ๋ ˆ์ง€์Šคํ„ฐ ํŒŒ์ผ ๋ณต์› ๊ธฐ๋ฒ•์ธ Penny๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ด๋Š” ์ตœ์‹ ์˜ ๋ฉฑ๋“ฑ์„ฑ(idempotency) ๊ธฐ๋ฐ˜ ์—๋Ÿฌ ๋ณต์› ๊ธฐ๋ฒ•์„ ์ €๋น„์šฉ์˜ ์—๋Ÿฌ ๊ฒ€์ถœ ์ฝ”๋“œ(EDC)์™€ ๊ฒฐํ•ฉํ•œ ๊ฒƒ์ด๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ๋‹ค์Œ ๋‘๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐ์— ์ง‘์ค‘ํ•œ๋‹ค. 1. ์—๋Ÿฌ ๊ฒ€์ถœ ์ฝ”๋“œ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฉฑ๋“ฑ์„ฑ ๊ธฐ๋ฐ˜ ์—๋Ÿฌ ๋ณต์›์„ ์‚ฌ์šฉ์‹œ ์†Œํ”„ํŠธ ์—๋Ÿฌ๋กœ๋ถ€ํ„ฐ์˜ ์•ˆ์ „ํ•œ ๋ณต์›์„ ๋ณด์žฅํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?} ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์—๋Ÿฌ ๊ฒ€์ถœ ์ฝ”๋“œ๊ฐ€ ๋ฉฑ๋“ฑ์„ฑ ๊ธฐ๋ฐ˜ ๋ณต์› ๊ธฐ์ˆ ๊ณผ ๊ฐ™์ด ์‚ฌ์šฉ๋˜์—ˆ์„ ๊ฒฝ์šฐ ๊ธฐ์กด์˜ ๋ณต์› ๊ธฐ๋ฒ•์—์„œ ํ•„์š”๋กœ ํ–ˆ๋˜ ์กฐ๊ฑด๋“ค ์—†์ด๋„ ์•ˆ์ „ํ•˜๊ฒŒ ์—๋Ÿฌ๋กœ๋ถ€ํ„ฐ ๋ณต์›ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค. 2. ์ฒดํฌํฌ์ธํŒ…์—๋“œ๋Š” ๋น„์šฉ์„ ์–ด๋–ป๊ฒŒ ์ ˆ๊ฐํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?} GPU๋Š” ์Šคํ† ์–ด ๋ฒ„ํผ๊ฐ€ ์—†๋Š” ๋“ฑ ์•„ํ‚คํ…์ณ์ ์ธ ํŠน์„ฑ์œผ๋กœ ์ธํ•ด์„œ CPU์™€ ๋น„๊ตํ•˜์—ฌ ์ฒดํฌํฌ์ธํŠธ ๊ฐ’์„ ์ €์žฅํ•˜๋Š” ๋ฐ์— ํฐ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋“ ๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์–‘ํ•œ ์ปดํŒŒ์ผ๋Ÿฌ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ํ†ตํ•˜์—ฌ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ค„์ธ๋‹ค.1 Introduction 1 1.1 Why is Soft Error Resilience Important in GPUs 1 1.2 How can the ECC Overhead be Reduced 3 1.3 What are the Challenges 4 1.4 How do We Solve the Challenges 5 2 Comparison of Error Detection and Correction Coding Schemes for Register File Protection 7 2.1 Error Correction Codes and Error Detection Codes 8 2.2 Cost of Coding Schemes 9 2.3 Soft Error Frequency of GPUs 11 3 Idempotent Recovery and Challenges 13 3.1 Idempotent Execution 13 3.2 Previous Idempotent Schemes 13 3.2.1 De Kruijf's Idempotent Translation 14 3.2.2 Bolts's Idempotent Recovery 15 3.2.3 Comparison between Idempotent Schemes 15 3.3 Idempotent Recovery Process 17 3.4 Idempotent Recovery Challenges for GPUs 18 3.4.1 Checkpoint Overwriting 20 3.4.2 Performance Overhead 20 4 Correctness of Recovery 22 4.1 Proof of Safe Recovery 23 4.1.1 Prevention of Error Propagation 23 4.1.2 Proof of Correct State Recovery 24 4.1.3 Correctness in Multi-Threaded Execution 28 4.2 Preventing Checkpoint Overwriting 30 4.2.1 Register renaming 31 4.2.2 Storage Alternation by Checkpoint Coloring 33 4.2.3 Automatic Algorithm Selection 38 4.2.4 Future Works 38 5 Performance Optimizations 40 5.1 Compilation Phases of Penny 40 5.1.1 Region Formation 41 5.1.2 Bimodal Checkpoint Placement 41 5.1.3 Storage Alternation 42 5.1.4 Checkpoint Pruning 43 5.1.5 Storage Assignment 44 5.1.6 Code Generation and Low-level Optimizations 45 5.2 Cost Estimation Model 45 5.3 Region Formation 46 5.3.1 De Kruijf's Heuristic Region Formation 46 5.3.2 Region splitting and Region Stitching 47 5.3.3 Checkpoint-Cost Aware Optimal Region Formation 48 5.4 Bimodal Checkpoint Placement 52 5.5 Optimal Checkpoint Pruning 55 5.5.1 Bolt's Naive Pruning Algorithm and Overview of Penny's Optimal Pruning Algorithm 55 5.5.2 Phase 1: Collecting Global-Decision Independent Status 56 5.5.3 Phase2: Ordering and Finalizing Renaming Decisions 60 5.5.4 Effectiveness of Eliminating the Checkpoints 63 5.6 Automatic Checkpoint Storage Assignment 69 5.7 Low-Level Optimizations and Code Generation 70 6 Evaluation 74 6.1 Test Environment 74 6.1.1 GPU Architecture and Simulation Setup 74 6.1.2 Tested Applications 75 6.1.3 Register Assignment 76 6.2 Performance Evaluation 77 6.2.1 Overall Performance Overheads 77 6.2.2 Impact of Penny's Optimizations 78 6.2.3 Assigning Checkpoint Storage and Its Integrity 79 6.2.4 Impact of Optimal Checkpoint Pruning 80 6.2.5 Impact of Alias Analysis 81 6.3 Repurposing the Saved ECC Area 82 6.4 Energy Impact on Execution 83 6.5 Performance Overhead on Volta Architecture 85 6.6 Compilation Time 85 7 RelatedWorks 87 8 Conclusion and Future Works 89 8.1 Limitation and Future Work 90Docto

    Symbolic Crosschecking of Data-Parallel Floating Point Code

    No full text
    In this thesis we present a symbolic execution-based technique for cross-checking programs accelerated using SIMD or OpenCL against an unaccelerated version, as well as a technique for detecting data races in OpenCL programs. Our techniques are implemented in KLEE-CL, a symbolic execution engine based on KLEE that supports symbolic reasoning on the equivalence between expressions involving both integer and floating-point operations. While the current generation of constraint solvers provide good support for integer arithmetic, there is little support available for floating-point arithmetic, due to the complexity inherent in such computations. The key insight behind our approach is that floating-point values are only reliably equal if they are essentially built by the same operations. This allows us to use an algorithm based on symbolic expression matching augmented with canonicalisation rules to determine path equivalence. Under symbolic execution, we have to verify equivalence along every feasible control-flow path. We reduce the branching factor of this process by aggressively merging conditionals, if-converting branches into select operations via an aggressive phi-node folding transformation. To support the Intel Streaming SIMD Extension (SSE) instruction set, we lower SSE instructions to equivalent generic vector operations, which in turn are interpreted in terms of primitive integer and floating-point operations. To support OpenCL programs, we symbolically model the OpenCL environment using an OpenCL runtime library targeted to symbolic execution. We detect data races by keeping track of all memory accesses using a memory log, and reporting a race whenever we detect that two accesses conflict. By representing the memory log symbolically, we are also able to detect races associated with symbolically indexed accesses of memory objects. We used KLEE-CL to find a number of issues in a variety of open source projects that use SSE and OpenCL, including mismatches between implementations, memory errors, race conditions and compiler bugs

    A Fine-grained Performance Model for GPU Architectures

    Get PDF
    The increasing programmability, performance, and cost/effectiveness of GPUs have led to a widespread use of such many-core architectures to accelerate general purpose applications. Nevertheless, tuning applications to efficiently exploit the GPU potentiality is a very challenging task, especially for inexperienced programmers. This is due to the difficulty of developing a SW application for the specific GPU architectural configuration, which includes managing the memory hierarchy and optimizing the execution of thousands of concurrent threads while maintaining the semantic correctness of the application. Even though several profiling tools exist, which provide programmerswith a large number of metrics and measurements, it is often difficult to interpret such information for effectively tuning the application. This paper presents a performance model that allows accurately estimating the potential performance of the application under tuning on a given GPU device and, at the same time, it provides programmers with interpretable profiling hints. The paper shows the results obtained by applying the proposedmodel for profiling commonly used primitives and real codes

    Symbolic crosschecking of data-parallel floating-point code

    Get PDF

    Working With Incremental Spatial Data During Parallel (GPU) Computation

    Get PDF
    Central to many complex systems, spatial actors require an awareness of their local environment to enable behaviours such as communication and navigation. Complex system simulations represent this behaviour with Fixed Radius Near Neighbours (FRNN) search. This algorithm allows actors to store data at spatial locations and then query the data structure to find all data stored within a fixed radius of the search origin. The work within this thesis answers the question: What techniques can be used for improving the performance of FRNN searches during complex system simulations on Graphics Processing Units (GPUs)? It is generally agreed that Uniform Spatial Partitioning (USP) is the most suitable data structure for providing FRNN search on GPUs. However, due to the architectural complexities of GPUs, the performance is constrained such that FRNN search remains one of the most expensive common stages between complex systems models. Existing innovations to USP highlight a need to take advantage of recent GPU advances, reducing the levels of divergence and limiting redundant memory accesses as viable routes to improve the performance of FRNN search. This thesis addresses these with three separate optimisations that can be used simultaneously. Experiments have assessed the impact of optimisations to the general case of FRNN search found within complex system simulations and demonstrated their impact in practice when applied to full complex system models. Results presented show the performance of the construction and query stages of FRNN search can be improved by over 2x and 1.3x respectively. These improvements allow complex system simulations to be executed faster, enabling increases in scale and model complexity

    pocl: A Performance-Portable OpenCL Implementation

    Get PDF
    OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear; multiple vendors can provide support for application descriptions written according to the standard, thus reducing the program porting effort. While the standard brings the obvious benefits of platform portability, the performance portability aspects are largely left to the programmer. The situation is made worse due to multiple proprietary vendor implementations with different characteristics, and, thus, required optimization strategies. In this paper, we propose an OpenCL implementation that is both portable and performance portable. At its core is a kernel compiler that can be used to exploit the data parallelism of OpenCL programs on multiple platforms with different parallel hardware styles. The kernel compiler is modularized to perform target-independent parallel region formation separately from the target-specific parallel mapping of the regions to enable support for various styles of fine-grained parallel resources such as subword SIMD extensions, SIMD datapaths and static multi-issue. Unlike previous similar techniques that work on the source level, the parallel region formation retains the information of the data parallelism using the LLVM IR and its metadata infrastructure. This data can be exploited by the later generic compiler passes for efficient parallelization. The proposed open source implementation of OpenCL is also platform portable, enabling OpenCL on a wide range of architectures, both already commercialized and on those that are still under research. The paper describes how the portability of the implementation is achieved. Our results show that most of the benchmarked applications when compiled using pocl were faster or close to as fast as the best proprietary OpenCL implementation for the platform at hand.Comment: This article was published in 2015; it is now openly accessible via arxi
    • โ€ฆ
    corecore