157 research outputs found

    Power-aware caches for GPGPUs

    Get PDF
    In this thesis, we propose two optimization techniques to reduce power consumption in L1 caches (data, texture and constant), shared memory and L2 cache. The first optimization technique targets static power. Evaluation of GPGPU applications shows that once a cache block is accessed by a thread, it takes several hundreds of clock cycles until the same block is accessed again. The long inter-access cycle can be used to put cache cells into drowsy mode and reduce static power. While drowsy cells reduce static power, they increase access time as voltage of a cache cell in drowsy mode should be raised before the block can be accessed. To mitigate performance impact of drowsy cells, we propose a novel technique called coarse grained drowsy mode. In coarse grained drowsy mode, we partition each cache into regions of consecutive cache blocks and wake up a region upon cache access. Due to temporal and spatial locality of cache accesses, this method dramatically reduces performance impact caused by drowsy cells. The second optimization technique relies on branch divergence in GPGPUs. The execution model in GPGPUs is Single Instruction Multiple Thread (SIMT) which means processing cores execute the same instruction with different data for GPGPU threads. The SIMT execution model may result in divergence of threads when a control instruction is executed. GPGPUs execute branch instructions in two phases. In the first phase, threads in the taken path are active and the rest are idle. In the second phase, threads in the not-taken path are executed and the rest are idle. Contemporary GPGPUs access all portions of cache blocks, although some threads are idle due to branch divergence. We propose accessing only portions of cache blocks corresponding to active threads. By disabling unnecessary sections of cache blocks, we are able to reduce dynamic power of caches. Our results show that on average, the two optimization techniques together reduce power of caches by up to 98% and 15% for static and dynamic power, respectively

    A Survey of Techniques for Architecting TLBs

    Get PDF
    “Translation lookaside buffer” (TLB) caches virtual to physical address translation information and is used in systems ranging from embedded devices to high-end servers. Since TLB is accessed very frequently and a TLB miss is extremely costly, prudent management of TLB is important for improving performance and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and distinctions. We believe that this paper will be useful for chip designers, computer architects and system engineers

    Characterization of Neural Network Backpropagation on Chiplet-based GPU Architectures

    Get PDF
    Advances in parallel computing architectures (e.g., Graphics Processing Units (GPUs)) have had great success in helping meet the performance and energy-efficiency demands of many high-performance computing (HPC) applications. DRAM bandwidth is generally a critical performance bottleneck for many of such applications. With the advances in memory technology, the DRAM bandwidth bottleneck is shifting towards other parts of the system hierarchy (e.g., interconnects). We identify neural network backpropagation as one application where the interconnect network is one of the biggest performance bottlenecks. We show that the interconnect bottleneck for backpropagation can be significantly alleviated if computing cores and caching units are carefully tiled (an architecture commonly known as ``chiplet ) and organized on the interconnect fabric. To simulate a chiplet design, we augment an existing, well-documented GPU simulator, GPGPU-Sim. Our modifications add an additional level of cache between on-chip L1s and an interconnect network-on-chip. This additional layer of cache reduces demand on the interconnect by localizing memory traffic to individual chiplets. We show that under a fixed core budget with additional cache, a chiplet architecture can increase Instruction Per Cycle (IPC) counts for important CUDA kernels by up to 20% during the training phase

    Reducing Cache Contention On GPUs

    Get PDF
    The usage of Graphics Processing Units (GPUs) as an application accelerator has become increasingly popular because, compared to traditional CPUs, they are more cost-effective, their highly parallel nature complements a CPU, and they are more energy efficient. With the popularity of GPUs, many GPU-based compute-intensive applications (a.k.a., GPGPUs) present significant performance improvement over traditional CPU-based implementations. Caches, which significantly improve CPU performance, are introduced to GPUs to further enhance application performance. However, the effect of caches is not significant for many cases in GPUs and even detrimental for some cases. The massive parallelism of the GPU execution model and the resulting memory accesses cause the GPU memory hierarchy to suffer from significant memory resource contention among threads. One cause of cache contention arises from column-strided memory access patterns that GPU applications commonly generate in many data-intensive applications. When such access patterns are mapped to hardware thread groups, they become memory-divergent instructions whose memory requests are not GPU hardware friendly, resulting in serialized access and performance degradation. Cache contention also arises from cache pollution caused by lines with low reuse. For the cache to be effective, a cached line must be reused before its eviction. Unfortunately, the streaming characteristic of GPGPU workloads and the massively parallel GPU execution model increase the reuse distance, or equivalently reduce reuse frequency of data. In a GPU, the pollution caused by a large reuse distance data is significant. Memory request stall is another contention factor. A stalled Load/Store (LDST) unit does not execute memory requests from any ready warps in the issue stage. This stall prevents the potential hit chances for the ready warps. This dissertation proposes three novel architectural modifications to reduce the contention: 1) contention-aware selective caching detects the memory-divergent instructions caused by the column-strided access patterns, calculates the contending cache sets and locality information and then selectively caches; 2) locality-aware selective caching dynamically calculates the reuse frequency with efficient hardware and caches based on the reuse frequency; and 3) memory request scheduling queues the memory requests from a warp issuing stage, frees the LDST unit stall and schedules items from the queue to the LDST unit by multiple probing of the cache. Through systematic experiments and comprehensive comparisons with existing state-of-the-art techniques, this dissertation demonstrates the effectiveness of our aforementioned techniques and the viability of reducing cache contention through architectural support. Finally, this dissertation suggests other promising opportunities for future research on GPU architecture

    Using hybrid shared and distributed caching for mixed-coherency GPU workloads

    Get PDF
    Current GPU computing models support a mixture of coherent and incoherent classes of memory operations. Workloads using these models typically have working sets too large to fit in an economical SRAM structure. Still, GPU architectures have last-level caches to primarily fulfill two functions: eliminate redundant DRAM accesses servicing requests from different L1 caches to the same line, and maintain on-chip memory coherence for the coherent class of memory operations. In this thesis, we propose an alternative memory system design for GPU architectures better fit for their workloads. Our architectural design features a directory-like sharing tracker that allows the incoherent private L1 caches to directly satisfy remote requests for shared data. It also retains a shared L2 cache with a customized caching policy to support coherent accesses on-chip and better serve non-coalesced requests that contend aggressively for cache lines. This thesis characterizes the novel and intriguing tradeoffs between the components of our proposed memory system design for area, energy, and performance. We show that the proposed design achieves a 22% average reduction in DRAM data demand over a standard GPU architecture with 1MB L2 cache, leading to an overall 28% reduction in the memory system energy consumption on average. Conversely, our results show that the DRAM data demand of the proposed design with 256KB L2 cache is on par with a standard GPU architecture with 1MB L2 cache, albeit at a smaller area overhead and power leakage. Our results, while drawn on motivations from the GPU realm, are not architecture-specific and can be extended to other throughput-oriented many-core organizations
    corecore