171 research outputs found

    Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose Processors

    Get PDF
    abstract: General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions. Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%. Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications. Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future. In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Intelligent Scheduling and Memory Management Techniques for Modern GPU Architectures

    Get PDF
    abstract: With the massive multithreading execution feature, graphics processing units (GPUs) have been widely deployed to accelerate general-purpose parallel workloads (GPGPUs). However, using GPUs to accelerate computation does not always gain good performance improvement. This is mainly due to three inefficiencies in modern GPU and system architectures. First, not all parallel threads have a uniform amount of workload to fully utilize GPU’s computation ability, leading to a sub-optimal performance problem, called warp criticality. To mitigate the degree of warp criticality, I propose a Criticality-Aware Warp Acceleration mechanism, called CAWA. CAWA predicts and accelerates the critical warp execution by allocating larger execution time slices and additional cache resources to the critical warp. The evaluation result shows that with CAWA, GPUs can achieve an average of 1.23x speedup. Second, the shared cache storage in GPUs is often insufficient to accommodate demands of the large number of concurrent threads. As a result, cache thrashing is commonly experienced in GPU’s cache memories, particularly in the L1 data caches. To alleviate the cache contention and thrashing problem, I develop an instruction aware Control Loop Based Adaptive Bypassing algorithm, called Ctrl-C. Ctrl-C learns the cache reuse behavior and bypasses a portion of memory requests with the help of feedback control loops. The evaluation result shows that Ctrl-C can effectively improve cache utilization in GPUs and achieve an average of 1.42x speedup for cache sensitive GPGPU workloads. Finally, GPU workloads and the co-located processes running on the host chip multiprocessor (CMP) in a heterogeneous system setup can contend for memory resources in multiple levels, resulting in significant performance degradation. To maximize the system throughput and balance the performance degradation of all co-located applications, I design a scalable performance degradation predictor specifically for heterogeneous systems, called HeteroPDP. HeteroPDP predicts the application execution time and schedules OpenCL workloads to run on different devices based on the optimization goal. The evaluation result shows HeteroPDP can improve the system fairness from 24% to 65% when an OpenCL application is co-located with other processes, and gain an additional 50% speedup compared with always offloading the OpenCL workload to GPUs. In summary, this dissertation aims to provide insights for the future microarchitecture and system architecture designs by identifying, analyzing, and addressing three critical performance problems in modern GPUs.Dissertation/ThesisDoctoral Dissertation Computer Engineering 201

    Exploiting intra-warp address monotonicity for fast memory coalescing in GPUs

    Get PDF
    Graphics Processing Units (GPUs) are growing increasingly popular as general purpose compute accelerators. GPUs are best suited for applications which have abundant data parallelism wherein the computation expressed as a single thread can be applied over a large set of data items. One key constraint that affects application performance on GPUs is that the underlying hardware is single-instruction, multiple data (SIMD) hardware which requires parallel instructions from the multiple threads to execute in a lock-step manner. The benefits of lock-step execution can be seriously degraded if the threads diverge (because of memory or branches). Specifically in the case of memory, the addresses from each thread in a SIMD wavefront/warp must be coalesced to enable parallel memory access to minimize divergence. ^ The general problem of coalescing assumes arbitrary address distribution which can be slow. This thesis aims to exploit intra-warp address monotonicity (as measured in a recent study by Holic) to achieve fast memory coalescing. Holic\u27s study reveals the intra-warp addresses are monotonically increasing or decreasing in the common case. The key contributions of this thesis are twofold. First, I design novel hardware coalescing mechanisms to achieve fast-coalescing and quantify the area/delay of my coalescing designs. Second, I quantify the impact of fast-coalescing on overall GPU performance for a suite of GPU benchmarks

    Simple out of order core for GPGPUs

    Get PDF
    GPU architectures have become popular for executing general-purpose programs which rely on having a large number of threads that run concurrently to hide the latency among dependent instructions. This approach has an important cost/overhead in terms of low data locality due to the increased pressure on the memory hierarchy of the many threads being run concurrently and the extra cost of storing and managing the on-chip state of those many threads. This paper presents SOCGPU (Simple Out-of-order Core for GPU), a simple out-of-order execution mechanism that does not require register renaming nor scoreboards. It uses a small Instruction Buffer and a tiny Dependence matrix to keep track of dependencies among instructions and avoid data hazards. Evaluations for an Nvidia Tesla V100-like GPU show that SOCGPU provides a speed-up of up to 2.3 in some machine learning programs and 1.38 on average for a variety of benchmarks, while it reduces energy consumption by 6.5%, with only 2.4% area overhead.This work has been supported by the CoCoUnit ERC Advanced Grant of the EU’s Horizon 2020 program (grant No 833057), the Spanish State Research Agency (MCIN/AEI) under grant PID2020-113172RB-I00, and the ICREA Academia program.Peer ReviewedPostprint (author's final draft

    Improving instruction scheduling in GPGPUs

    Get PDF
    GPU architectures have become popular for executing general-purpose programs. Moreover, they are some of the most efficient architectures for machine learning applications which are among the most trendy and demanding applications these days. GPUs rely on having a large number of threads that run concurrently to hide the latency among dependent instructions. This work presents SOCGPU (Simple Out-of-order Core for GPU), a simple out-of-order execution mechanism that does not require register renaming nor scoreboards. It uses a small Instruction Buffer and a tiny Dependence matrix to keep track of dependencies among instructions and avoid data hazards. Evaluations for an Nvidia GTX1080TI-like GPU show that SOCGPU provides a speed-up up to 3.76 in some machine learning programs and 1.58 on average for a variety of benchmarks, while it reduces energy consumption by 17.6%, with only 3.48% area overhead when using the same number of warps as the baseline. Moreover, we show that SOCGPU can reduce the number of concurrently running warps without hardly affecting performance, which can provide significant reductions in area, especially in the register file and the instruction scheduler logic, as well as other hardware structures of the GPU cores

    CODAG: Characterizing and Optimizing Decompression Algorithms for GPUs

    Full text link
    Data compression and decompression have become vital components of big-data applications to manage the exponential growth in the amount of data collected and stored. Furthermore, big-data applications have increasingly adopted GPUs due to their high compute throughput and memory bandwidth. Prior works presume that decompression is memory-bound and have dedicated most of the GPU's threads to data movement and adopted complex software techniques to hide memory latency for reading compressed data and writing uncompressed data. This paper shows that these techniques lead to poor GPU resource utilization as most threads end up waiting for the few decoding threads, exposing compute and synchronization latencies. Based on this observation, we propose CODAG, a novel and simple kernel architecture for high throughput decompression on GPUs. CODAG eliminates the use of specialized groups of threads, frees up compute resources to increase the number of parallel decompression streams, and leverages the ample compute activities and the GPU's hardware scheduler to tolerate synchronization, compute, and memory latencies. Furthermore, CODAG provides a framework for users to easily incorporate new decompression algorithms without being burdened with implementing complex optimizations to hide memory latency. We validate our proposed architecture with three different encoding techniques, RLE v1, RLE v2, and Deflate, and a wide range of large datasets from different domains. We show that CODAG provides 13.46x, 5.69x, and 1.18x speed up for RLE v1, RLE v2, and Deflate, respectively, when compared to the state-of-the-art decompressors from NVIDIA RAPIDS
    • …
    corecore