2 research outputs found

    Characterization of Neural Network Backpropagation on Chiplet-based GPU Architectures

    Get PDF
    Advances in parallel computing architectures (e.g., Graphics Processing Units (GPUs)) have had great success in helping meet the performance and energy-efficiency demands of many high-performance computing (HPC) applications. DRAM bandwidth is generally a critical performance bottleneck for many of such applications. With the advances in memory technology, the DRAM bandwidth bottleneck is shifting towards other parts of the system hierarchy (e.g., interconnects). We identify neural network backpropagation as one application where the interconnect network is one of the biggest performance bottlenecks. We show that the interconnect bottleneck for backpropagation can be significantly alleviated if computing cores and caching units are carefully tiled (an architecture commonly known as ``chiplet ) and organized on the interconnect fabric. To simulate a chiplet design, we augment an existing, well-documented GPU simulator, GPGPU-Sim. Our modifications add an additional level of cache between on-chip L1s and an interconnect network-on-chip. This additional layer of cache reduces demand on the interconnect by localizing memory traffic to individual chiplets. We show that under a fixed core budget with additional cache, a chiplet architecture can increase Instruction Per Cycle (IPC) counts for important CUDA kernels by up to 20% during the training phase
    corecore