42 research outputs found
Experiences with porting and modelling wavefront algorithms on many-core architectures
We are currently investigating the viability of many-core architectures for the acceleration of wavefront applications and this report focuses on graphics processing units (GPUs) in particular. To this end, we have implemented NASA’s LU benchmark – a real world production-grade application – on GPUs employing NVIDIA’s Compute Unified Device Architecture (CUDA).
This GPU implementation of the benchmark has been used to investigate the performance of a selection of GPUs, ranging from workstation-grade commodity GPUs to the HPC "Tesla” and "Fermi” GPUs. We have also compared the performance of the GPU solution at scale to that of traditional high perfor- mance computing (HPC) clusters based on a range of multi- core CPUs from a number of major vendors, including Intel (Nehalem), AMD (Opteron) and IBM (PowerPC).
In previous work we have developed a predictive “plug-and-play” performance model of this class of application running on such clusters, in which CPUs communicate via the Message Passing Interface (MPI). By extending this model to also capture the performance behaviour of GPUs, we are able to: (1) comment on the effects that architectural changes will have on the performance of single-GPU solutions, and (2) make projections regarding the performance of multi-GPU solutions at larger scale
CampProf: A Visual Performance Analysis Tool for Memory Bound GPU Kernels
Current GPU tools and performance models provide some common architectural insights that guide the programmers to write optimal code. We challenge these performance models, by modeling and analyzing a lesser known, but very severe performance pitfall, called 'Partition Camping', in NVIDIA GPUs. Partition Camping is caused by memory accesses that are skewed towards a subset of the available memory partitions, which may degrade the performance of memory-bound CUDA kernels by up to seven-times. No existing tool can detect the partition camping effect in CUDA kernels.
We complement the existing tools by developing 'CampProf', a spreadsheet based, visual analysis tool, that detects the degree to which any memory-bound kernel suffers from partition camping. In addition, CampProf also predicts the kernel's performance at all execution configurations, if its performance parameters are known at any one of them. To demonstrate the utility of CampProf, we analyze three different applications using our tool, and demonstrate how it can be used to discover partition camping. We also demonstrate how CampProf can be used to monitor the performance improvements in the kernels, as the partition camping effect is being removed.
The performance model that drives CampProf was developed by applying multiple linear regression techniques over a set of specific micro-benchmarks that simulated the partition camping behavior. Our results show that the geometric mean of errors in our prediction model is within 12% of the actual execution times. In summary, CampProf is a new, accurate, and easy-to-use tool that can be used in conjunction with the existing tools to analyze and improve the overall performance of memory-bound CUDA kernels
Preemptive Thread Block Scheduling with Online Structural Runtime Prediction for Concurrent GPGPU Kernels
Recent NVIDIA Graphics Processing Units (GPUs) can execute multiple kernels
concurrently. On these GPUs, the thread block scheduler (TBS) uses the FIFO
policy to schedule their thread blocks. We show that FIFO leaves performance to
chance, resulting in significant loss of performance and fairness. To improve
performance and fairness, we propose use of the preemptive Shortest Remaining
Time First (SRTF) policy instead. Although SRTF requires an estimate of runtime
of GPU kernels, we show that such an estimate of the runtime can be easily
obtained using online profiling and exploiting a simple observation on GPU
kernels' grid structure. Specifically, we propose a novel Structural Runtime
Predictor. Using a simple Staircase model of GPU kernel execution, we show that
the runtime of a kernel can be predicted by profiling only the first few thread
blocks. We evaluate an online predictor based on this model on benchmarks from
ERCBench, and find that it can estimate the actual runtime reasonably well
after the execution of only a single thread block. Next, we design a thread
block scheduler that is both concurrent kernel-aware and uses this predictor.
We implement the SRTF policy and evaluate it on two-program workloads from
ERCBench. SRTF improves STP by 1.18x and ANTT by 2.25x over FIFO. When compared
to MPMax, a state-of-the-art resource allocation policy for concurrent kernels,
SRTF improves STP by 1.16x and ANTT by 1.3x. To improve fairness, we also
propose SRTF/Adaptive which controls resource usage of concurrently executing
kernels to maximize fairness. SRTF/Adaptive improves STP by 1.12x, ANTT by
2.23x and Fairness by 2.95x compared to FIFO. Overall, our implementation of
SRTF achieves system throughput to within 12.64% of Shortest Job First (SJF, an
oracle optimal scheduling policy), bridging 49% of the gap between FIFO and
SJF.Comment: 14 pages, full pre-review version of PACT 2014 poste
Testing the Divergence Stack Memory on GPGPUs: A Modular in-Field Test Strategy
General Purpose Graphic Processing Units (GPGPUs) are becoming a promising solution in safety-critical applications, e.g., in the automotive domain. In these applications, reliability and functional safety are relevant factors in the selection of devices to build the systems. Nowadays, many challenges are impacting the implementation of high-performance devices, such as GPGPUs. Moreover, there is the need for effective fault detection solutions to guarantee the correct in-field operation of a GPGPU, such as in the branch management unit, which is one of the most critical modules in this parallel architecture. Faults affecting this structure can heavily corrupt or even collapse the execution of an application on the GPGPU. In this work, we propose a non-invasive Software-Based Self-Test (SBST) solution to detect faults affecting the memory in the branch management unit of a GPGPU. We propose a scalar and modular mechanism to develop the test program as a combination of software functions. The FlexGripPlus model was employed to evaluate the proposed strategies experimentally. Results show that the proposed strategies are effective to test the target structure and detect up to 98% of permanent faults. General Purpose Graphic Processing Units (GPGPUs) are becoming a promising solution in safety-critical applications, e.g., in the automotive domain. In these applications, reliability and functional safety are relevant factors in the selection of devices to build the systems. Nowadays, many challenges are impacting the implementation of high-performance devices, such as GPGPUs. Moreover, there is the need for effective fault detection solutions to guarantee the correct in-field operation of a GPGPU, such as in the branch management unit, which is one of the most critical modules in this parallel architecture. Faults affecting this structure can heavily corrupt or even collapse the execution of an application on the GPGPU. In this work, we propose a non-invasive Software-Based Self-Test (SBST) solution to detect faults affecting the memory in the branch management unit of a GPGPU. We propose a scalar and modular mechanism to develop the test program as a combination of software functions. The FlexGripPlus model was employed to evaluate the proposed strategies experimentally. Results show that the proposed strategies are effective to test the target structure and detect up to 98% of permanent faults
ARGO: An Auto-Tuning Runtime System for Scalable GNN Training on Multi-Core Processor
As Graph Neural Networks (GNNs) become popular, libraries like
PyTorch-Geometric (PyG) and Deep Graph Library (DGL) are proposed; these
libraries have emerged as the de facto standard for implementing GNNs because
they provide graph-oriented APIs and are purposefully designed to manage the
inherent sparsity and irregularity in graph structures. However, these
libraries show poor scalability on multi-core processors, which under-utilizes
the available platform resources and limits the performance. This is because
GNN training is a resource-intensive workload with high volume of irregular
data accessing, and existing libraries fail to utilize the memory bandwidth
efficiently. To address this challenge, we propose ARGO, a novel runtime system
for GNN training that offers scalable performance. ARGO exploits
multi-processing and core-binding techniques to improve platform resource
utilization. We further develop an auto-tuner that searches for the optimal
configuration for multi-processing and core-binding. The auto-tuner works
automatically, making it completely transparent from the user. Furthermore, the
auto-tuner allows ARGO to adapt to various platforms, GNN models, datasets,
etc. We evaluate ARGO on two representative GNN models and four widely-used
datasets on two platforms. With the proposed autotuner, ARGO is able to select
a near-optimal configuration by exploring only 5% of the design space. ARGO
speeds up state-of-the-art GNN libraries by up to 5.06x and 4.54x on a
four-socket Ice Lake machine with 112 cores and a two-socket Sapphire Rapids
machine with 64 cores, respectively. Finally, ARGO can seamlessly integrate
into widely-used GNN libraries (e.g., DGL, PyG) with few lines of code and
speed up GNN training.Comment: To appear in IEEE International Parallel and Distributed Processing
Symposium (IPDPS) 202
Modular Functional Testing: Targeting the Small Embedded Memories in GPUs
Graphic Processing Units (GPUs) are promising solutions in safety-critical applications, e.g., in the automotive domain. In these applications, reliability and functional safety are relevant factors. Nowadays, many challenges are impacting the implementation of high-performance devices, including GPUs. Moreover, there is a need for effective fault detection solutions to guarantee the correct in-field operation. This work describes a modular approach to developing functional testing solutions based on the non-invasive Software-Based Self-Test (SBST) strategy. We propose a scalar and modular mechanism to develop test programs based on schematic organizations of functions allowing the exploration of different solutions using software functions. The FlexGripPlus model was employed to evaluate experimentally the proposed strategies, targeting the embedded memories in the GPU. Results show that the proposed strategies are effective to test the target structures and detect from 98% up to 100% of permanent stuck-at faults