2 research outputs found
A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps
Modern Graphics Processing Units (GPUs) are well provisioned to support the
concurrent execution of thousands of threads. Unfortunately, different
bottlenecks during execution and heterogeneous application requirements create
imbalances in utilization of resources in the cores. For example, when a GPU is
bottlenecked by the available off-chip memory bandwidth, its computational
resources are often overwhelmingly idle, waiting for data from memory to
arrive.
This work describes the Core-Assisted Bottleneck Acceleration (CABA)
framework that employs idle on-chip resources to alleviate different
bottlenecks in GPU execution. CABA provides flexible mechanisms to
automatically generate "assist warps" that execute on GPU cores to perform
specific tasks that can improve GPU performance and efficiency.
CABA enables the use of idle computational units and pipelines to alleviate
the memory bandwidth bottleneck, e.g., by using assist warps to perform data
compression to transfer less data from memory. Conversely, the same framework
can be employed to handle cases where the GPU is bottlenecked by the available
computational units, in which case the memory pipelines are idle and can be
used by CABA to speed up computation, e.g., by performing memoization using
assist warps.
We provide a comprehensive design and evaluation of CABA to perform effective
and flexible data compression in the GPU memory hierarchy to alleviate the
memory bandwidth bottleneck. Our extensive evaluations show that CABA, when
used to implement data compression, provides an average performance improvement
of 41.7% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU
applications
Helper Without Threads: Customized Prefetching for Delinquent Irregular Loads
The growing memory footprints of cloud and big data applications mean that
data center CPUs can spend significant time waiting for memory. An attractive
approach to improving performance in such centralized compute settings is to
employ prefetchers that are customized per application, where gains can be
easily scaled across thousands of machines. Helper thread prefetching is such a
technique but has yet to achieve wide adoption since it requires spare thread
contexts or special hardware/firmware support. In this paper, we propose an
inline software prefetching technique that overcomes these restrictions by
inserting the helper code into the main thread itself. Our approach is
complementary to and does not interfere with existing hardware prefetchers
since we target only delinquent irregular load instructions (those with no
constant or striding address patterns). For each chosen load instruction, we
generate and insert a customized software prefetcher extracted from and
mimicking the application's dataflow, all without access to the application
source code. For a set of irregular workloads that are memory-bound, we
demonstrate up to 2X single-thread performance improvement on recent high-end
hardware (Intel Skylake) and up to 83% speedup over a helper thread
implementation on the same hardware, due to the absence of thread spawning
overhead.Comment: 13 pages, 10 figure