834 research outputs found
Cooperative Caching for GPUs
The rise of general-purpose computing on GPUs has influenced architectural innovation on them. The introduction of an on-chip cache hierarchy is one such innovation. High L1 miss rates on GPUs, however, indicate inefficient cache usage due to myriad factors, such as cache thrashing and extensive multithreading. Such high L1 miss rates in turn place high demands on the shared L2 bandwidth. Extensive congestion in the L2 access path therefore results in high memory access latencies. In memory-intensive applications, these latencies get exposed due to a lack of active compute threads to mask such high latencies.
In this article, we aim to reduce the pressure on the shared L2 bandwidth, thereby reducing the memory access latencies that lie in the critical path. We identify significant replication of data among private L1 caches, presenting an opportunity to reuse data among L1s. We further show how this reuse can be exploited via an L1 Cooperative Caching Network (CCN), thereby reducing the bandwidth demand on L2. In the proposed architecture, we connect the L1 caches with a lightweight ring network to facilitate intercore communication of shared data. We show that this technique reduces traffic to the L2 cache by an average of 29%, freeing up the bandwidth for other accesses. We also show that the CCN reduces the average memory latency by 24%, thereby reducing core stall cycles by 26% on average. This translates into an overall performance improvement of 14.7% on average (and up to 49%) for applications that exhibit reuse across L1 caches. In doing so, the CCN incurs a nominal area and energy overhead of 1.3% and 2.5%, respectively. Notably, the performance improvement with our proposed CCN compares favorably to the performance improvement achieved by simply doubling the number of L2 banks by up to 34%.</jats:p
Parallel Algorithm for Solving Kepler's Equation on Graphics Processing Units: Application to Analysis of Doppler Exoplanet Searches
[Abridged] We present the results of a highly parallel Kepler equation solver
using the Graphics Processing Unit (GPU) on a commercial nVidia GeForce 280GTX
and the "Compute Unified Device Architecture" programming environment. We apply
this to evaluate a goodness-of-fit statistic (e.g., chi^2) for Doppler
observations of stars potentially harboring multiple planetary companions
(assuming negligible planet-planet interactions). We tested multiple
implementations using single precision, double precision, pairs of single
precision, and mixed precision arithmetic. We find that the vast majority of
computations can be performed using single precision arithmetic, with selective
use of compensated summation for increased precision. However, standard single
precision is not adequate for calculating the mean anomaly from the time of
observation and orbital period when evaluating the goodness-of-fit for real
planetary systems and observational data sets. Using all double precision, our
GPU code outperforms a similar code using a modern CPU by a factor of over 60.
Using mixed-precision, our GPU code provides a speed-up factor of over 600,
when evaluating N_sys > 1024 models planetary systems each containing N_pl = 4
planets and assuming N_obs = 256 observations of each system. We conclude that
modern GPUs also offer a powerful tool for repeatedly evaluating Kepler's
equation and a goodness-of-fit statistic for orbital models when presented with
a large parameter space.Comment: 19 pages, to appear in New Astronom
Adaptive memory-side last-level GPU caching
Emerging GPU applications exhibit increasingly high computation demands which has led GPU manufacturers to build GPUs with an increasingly large number of streaming multiprocessors (SMs). Providing data to the SMs at high bandwidth puts significant pressure on the memory hierarchy and the Network-on-Chip (NoC). Current GPUs typically partition the memory-side last-level cache (LLC) in equally-sized slices that are shared by all SMs. Although a shared LLC typically results in a lower miss rate, we find that for workloads with high degrees of data sharing across SMs, a private LLC leads to a significant performance advantage because of increased bandwidth to replicated cache lines across different LLC slices.
In this paper, we propose adaptive memory-side last-level GPU caching to boost performance for sharing-intensive workloads that need high bandwidth to read-only shared data. Adaptive caching leverages a lightweight performance model that balances increased LLC bandwidth against increased miss rate under private caching. In addition to improving performance for sharing-intensive workloads, adaptive caching also saves energy in a (co-designed) hierarchical two-stage crossbar NoC by power-gating and bypassing the second stage if the LLC is configured as a private cache. Our experimental results using 17 GPU workloads show that adaptive caching improves performance by 28.1% on average (up to 38.1%) compared to a shared LLC for sharing-intensive workloads. In addition, adaptive caching reduces NoC energy by 26.6% on average (up to 29.7%) and total system energy by 6.1% on average (up to 27.2%) when configured as a private cache. Finally, we demonstrate through a GPU NoC design space exploration that a hierarchical two-stage crossbar is both more power- and area-efficient than full and concentrated crossbars with the same bisection bandwidth, thus providing a low-cost cooperative solution to exploit workload sharing behavior in memory-side last-level caches
Technique and Instrument for Effective CPU and GPU Access Request Arbitration Using On-Chip Cache
This research paper presents a novel on-chip cache procedure and device that effectively handles access requests from both CPUs and GPUs. The proposed procedure involves classification caching based on the access request type, arbitrating different types of access requests for caching, and optimizing access time for CPU requests through cache while bypassing cache for GPU requests. The device includes CPU and GPU request queues, a moderator, and cache performance elements. By considering the distinct access characteristics of CPUs and GPUs simultaneously, this approach offers high performance, simple hardware implementation, and minimal cost
Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUs
© 2023 Copyright held by the owner/author(s). This document is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/
This document is the Accepted version of a Published Work that appeared in final form in 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT), Viena, Austria, October 2023. To access the final edited and published work see https://doi.org/10.1109/PACT58117.2023.00019Literature is plentiful in works exploiting cache locality for GPUs. A majority of them explore replacement or bypassing policies. In this paper, however, we surpass this exploration by fabricating a formal proof for a no-overhead quasi-optimal caching technique for caching textures in graphics workloads. Textures make up a significant part of main memory traffic in mobile GPUs, which contributes to the total GPU energy consumption. Since texture accesses use a shared L2 cache, improving the L2 texture caching efficiency would decrease main memory traffic, thus improving energy efficiency, which is crucial for mobile GPUs. Our proposal reaches quasi-optimality by exploiting the frame-to-frame reuse of textures in graphics. We do this by traversing frames in a boustrophedonic1 manner w.r.t. the frame-to-frame tile order. We first approximate the texture access trace to a circular trace and then forge a formal proof for our proposal being optimal for such traces. We also complement the proof with empirical data that demonstrates the quasi-optimality of our no-cost proposal
Mixing multi-core CPUs and GPUs for scientific simulation software
Recent technological and economic developments have led to widespread availability of
multi-core CPUs and specialist accelerator processors such as graphical processing units
(GPUs). The accelerated computational performance possible from these devices can be very
high for some applications paradigms. Software languages and systems such as NVIDIA's
CUDA and Khronos consortium's open compute language (OpenCL) support a number of
individual parallel application programming paradigms. To scale up the performance of some
complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and
very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica-
tions using threading approaches and multi-core CPUs to control independent GPU devices.
We present speed-up data and discuss multi-threading software issues for the applications
level programmer and o er some suggested areas for language development and integration
between coarse-grained and ne-grained multi-thread systems. We discuss results from three
common simulation algorithmic areas including: partial di erential equations; graph cluster
metric calculations and random number generation. We report on programming experiences
and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs;
a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and
trends in multi-core programming for scienti c applications developers
Persistent Kernels for Iterative Memory-bound GPU Applications
Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU
implementations have a loop on the host side that invokes the GPU kernel as
much as time/algorithm steps there are. The termination of each kernel
implicitly acts as the barrier required after advancing the solution every time
step. We propose a scheme for running memory-bound iterative GPU kernels:
PERsistent KernelS (PERKS). In this scheme the time loop is moved inside a
persistent kernel, and device-wide barriers are used for synchronization. We
then reduce the traffic to device memory by caching a subset of the output in
each time step in registers and shared memory to be used as input for the
following time step. PERKS can be generalized to any iterative solver: they are
largely independent of the solver's implementation. We explain the design
principle of PERKS and demonstrate the effectiveness of PERKS for a wide range
of iterative 2D/3D stencil benchmarks (geometric mean speedup of x in
small domains and x in large domains), and a Krylov subspace solver
(geometric mean speedup of x in smaller SpMV datasets from SuiteSparse
and x in larger SpMV datasets, for conjugate gradient)
- …