22 research outputs found

    Reducing Cache Contention On GPUs

    Get PDF
    The usage of Graphics Processing Units (GPUs) as an application accelerator has become increasingly popular because, compared to traditional CPUs, they are more cost-effective, their highly parallel nature complements a CPU, and they are more energy efficient. With the popularity of GPUs, many GPU-based compute-intensive applications (a.k.a., GPGPUs) present significant performance improvement over traditional CPU-based implementations. Caches, which significantly improve CPU performance, are introduced to GPUs to further enhance application performance. However, the effect of caches is not significant for many cases in GPUs and even detrimental for some cases. The massive parallelism of the GPU execution model and the resulting memory accesses cause the GPU memory hierarchy to suffer from significant memory resource contention among threads. One cause of cache contention arises from column-strided memory access patterns that GPU applications commonly generate in many data-intensive applications. When such access patterns are mapped to hardware thread groups, they become memory-divergent instructions whose memory requests are not GPU hardware friendly, resulting in serialized access and performance degradation. Cache contention also arises from cache pollution caused by lines with low reuse. For the cache to be effective, a cached line must be reused before its eviction. Unfortunately, the streaming characteristic of GPGPU workloads and the massively parallel GPU execution model increase the reuse distance, or equivalently reduce reuse frequency of data. In a GPU, the pollution caused by a large reuse distance data is significant. Memory request stall is another contention factor. A stalled Load/Store (LDST) unit does not execute memory requests from any ready warps in the issue stage. This stall prevents the potential hit chances for the ready warps. This dissertation proposes three novel architectural modifications to reduce the contention: 1) contention-aware selective caching detects the memory-divergent instructions caused by the column-strided access patterns, calculates the contending cache sets and locality information and then selectively caches; 2) locality-aware selective caching dynamically calculates the reuse frequency with efficient hardware and caches based on the reuse frequency; and 3) memory request scheduling queues the memory requests from a warp issuing stage, frees the LDST unit stall and schedules items from the queue to the LDST unit by multiple probing of the cache. Through systematic experiments and comprehensive comparisons with existing state-of-the-art techniques, this dissertation demonstrates the effectiveness of our aforementioned techniques and the viability of reducing cache contention through architectural support. Finally, this dissertation suggests other promising opportunities for future research on GPU architecture

    GPU PERFORMANCE MODELLING AND OPTIMIZATION

    Get PDF
    Ph.DNUS-TU/E JOINT PH.D

    Design And Analysis Of Memory Management Techniques For Next-Generation Gpus

    Get PDF
    Graphics Processing Unit (GPU)-based architectures have become the default accelerator choice for a large number of data-parallel applications because they are able to provide high compute throughput at a competitive power budget. Unlike CPUs which typically have limited multi-threading capability, GPUs execute large numbers of threads concurrently to achieve high thread-level parallelism (TLP). While the computation of each thread requires its corresponding data to be loaded from or stored to the memory, the key to supporting the high TLP of GPUs lies in the high bandwidth provided by the GPU memory system. However, with the continuous scaling of GPUs, the challenges of designing an efficient GPU memory system have become two-fold. On one hand, to keep the growing compute and memory resources highly utilized, co-locating two or more kernels in the GPU has become an inevitable trend. One of the major roadblocks in achieving the maximum benefits of multi-application execution is the difficulty to design mechanisms that can efficiently and fairly manage the application interference in the shared caches and the main memory. On the other hand, to maintain the continuous scaling of GPU performance, the increasing energy consumption of the memory system has become a major problem because of its limited power budget. This limitation of the GPU memory energy restricts its maximum theoretical bandwidth and in turn limits the overall throughput. To address the aforementioned challenges, this dissertation proposes three different approaches. First, this dissertation shows that high efficiency and fairness can be achieved for GPU multi-programming with novel TLP management techniques. We propose a new metric, effective bandwidth (EB), to accurately estimate the shared resources in the GPU memory hierarchy. Meanwhile, we propose pattern-based searching scheme (PBS) that can quickly and accurately achieve efficiency or fairness via managing the TLP of each application. Second, to reduce data movement and improve GPU throughput, this dissertation develops Address-Stride Assisted Approximate Value Predictor (ASAP) for GPUs. We show that by utilizing address stride and value stride correlation present in GPGPU applications, significant data movement reduction and throughput improvement can be achieved at a much lower application quality loss and hardware overhead. ASAP achieves this by predicting load values if it detects strides in their corresponding addresses. Third, this dissertation shows that GPU memory energy can be significantly reduced by utilizing novel memory scheduling techniques. We propose a lazy memory scheduler which significantly improves the row buffer locality of GPU memory by leveraging the latency and error tolerance of GPGPU applications. Finally, our new work targets data movement reduction with flexible data precisions. We present initial results to motivate novel data types and architectural support to dynamically reduce the data size transferred per each memory operation. Altogether, this dissertation develops several innovative techniques to improve the GPU memory system efficiency, which are necessary for enabling the development of next-generation GPUs

    Cooperative Caching for GPUs

    Get PDF

    Enabling efficient graph computing with near-data processing techniques

    Get PDF
    With the emergence of data science, graph computing is becoming a crucial tool for processing big connected data. However, when mapped to modern computing systems, graph computing typically suffers from poor performance because of inefficiencies in memory subsystems. At the same time, emerging technologies, such as Hybrid Memory Cube (HMC), enable processing-in-memory (PIM) functionality, a promising technique of near-data processing (NDP), by integrating compute units in the 3D-stacked logic layer. The PIM units allows operation offloading at an instruction level, which has considerable potential to overcome the performance bottleneck of graph computing. Nevertheless, studies have not fully explored this functionality for graph workloads or identified its applications and shortcomings. The main objective of this dissertation is to enable NDP techniques for efficient graph computing. Specifically, it investigates the PIM offloading at instruction level. To achieve this goal, it presents a graph benchmark suite for understanding graph computing behaviors, and then proposes architectural techniques for PIM offloading on various host platforms. This dissertation first presents GraphBIG, a comprehensive graph benchmark suite. To cover major graph computation types and data sources, GraphBIG selects representative data representations, workloads, and datasets from 21 real-world use cases of multiple application domains. This dissertation characterized the benchmarks on real machines and observed extremely irregular memory patterns and significant diverse behaviors across various computation types. GraphBIG helps users understand the behavior of modern graph computing on hardware architectures and enables future architecture and system research for graph computing. To achieve better performance of graph computing, this dissertation proposes GraphPIM, a full-stack NDP solution for graph computing. This dissertation performs an analysis on modern graph workloads to assess the applicability of PIM offloading and presents hardware and software mechanisms to efficiently make use of the PIM functionality. Following the real-world HMC 2.0 specification, GraphPIM provides performance benefits for graph applications without any user code modification and ISA changes. In addition, GraphPIM proposes an extension to PIM operations that can further bring performance benefits for more graph applications. The evaluation results show that GraphPIM achieves up to a 2.4X speedup with a 37% reduction in energy consumption. To effectively utilize NDP systems with GPU-based host architectures that can fully utilize hundreds of gigabytes of bandwidth, this dissertation explores managing the thermal constraints of 3D-stacked memory cubes. Based on the real experiment with an HMC prototype, this study observes that the operating temperature of HMC is much higher than conventional DRAM, which can even cause thermal shutdown with a passive cooling solution. In addition, it also shows that even with a commodity-server cooling solution, HMC can fail to maintain the temperature of the memory dies within the normal operating range when in-memory processing is highly utilized, thereby resulting in higher energy consumption and performance overhead. To this end, this dissertation proposes CoolPIM, a thermal-aware source throttling mechanism that controls the intensity of PIM offloading on runtime. The proposed technique keeps the memory dies of HMC within the normal operating temperature using software-based techniques. The evaluation results show that CoolPIM achieves up to 1.4X and 1.37X speedups compared to non-offloading and naïve offloading scenarios.Ph.D
    corecore