2 research outputs found

    Unfavorable Strides in Cache Memory Systems (RNR Technical Report RNR-92-015)

    No full text
    An important issue in obtaining high performance on a scientific application running on a cache-based computer system is the behavior of the cache when data are accessed at a constant stride. Others who have discussed this issue have noted an odd phenomenon in such situations: A few particular innocent-looking strides result in sharply reduced cache efficiency. In this article, this problem is analyzed, and a simple formula is presented that accurately gives the cache efficiency for various cache parameters and data strides

    Improving GPU Shared Memory Access Efficiency

    Get PDF
    Graphic Processing Units (GPUs) often employ shared memory to provide efficient storage for threads within a computational block. This shared memory includes multiple banks to improve performance by enabling concurrent accesses across the memory banks. Conflicts occur when multiple memory accesses attempt to simultaneously access a particular bank, resulting in serialized access and concomitant performance reduction. Identifying and eliminating these memory bank access conflicts becomes critical for achieving high performance on GPUs; however, for common 1D and 2D access patterns, understanding the potential bank conflicts can prove difficult. Current GPUs support memory bank accesses with configurable bit-widths; optimizing these bitwidths could result in data layouts with fewer conflicts and better performance. This dissertation presents a framework for bank conflict analysis and automatic optimization. Given static access pattern information for a kernel, this tool analyzes the conflict number of each pattern, and then searches for an optimized solution for all shared memory buffers. This data layout solution is based on parameters for inter-padding, intrapadding, and the bank access bit-width. The experimental results show that static bank conflict analysis is a practical solution and independent of the workload size of a given access pattern. For 13 kernels from 6 benchmarks suites (RODINIA and NVIDIA CUDA SDK) facing shared memory bank conflicts, tests indicated this approach can gain 5%- 35% improvement in runtime
    corecore