18 research outputs found
Preemptive Thread Block Scheduling with Online Structural Runtime Prediction for Concurrent GPGPU Kernels
Recent NVIDIA Graphics Processing Units (GPUs) can execute multiple kernels
concurrently. On these GPUs, the thread block scheduler (TBS) uses the FIFO
policy to schedule their thread blocks. We show that FIFO leaves performance to
chance, resulting in significant loss of performance and fairness. To improve
performance and fairness, we propose use of the preemptive Shortest Remaining
Time First (SRTF) policy instead. Although SRTF requires an estimate of runtime
of GPU kernels, we show that such an estimate of the runtime can be easily
obtained using online profiling and exploiting a simple observation on GPU
kernels' grid structure. Specifically, we propose a novel Structural Runtime
Predictor. Using a simple Staircase model of GPU kernel execution, we show that
the runtime of a kernel can be predicted by profiling only the first few thread
blocks. We evaluate an online predictor based on this model on benchmarks from
ERCBench, and find that it can estimate the actual runtime reasonably well
after the execution of only a single thread block. Next, we design a thread
block scheduler that is both concurrent kernel-aware and uses this predictor.
We implement the SRTF policy and evaluate it on two-program workloads from
ERCBench. SRTF improves STP by 1.18x and ANTT by 2.25x over FIFO. When compared
to MPMax, a state-of-the-art resource allocation policy for concurrent kernels,
SRTF improves STP by 1.16x and ANTT by 1.3x. To improve fairness, we also
propose SRTF/Adaptive which controls resource usage of concurrently executing
kernels to maximize fairness. SRTF/Adaptive improves STP by 1.12x, ANTT by
2.23x and Fairness by 2.95x compared to FIFO. Overall, our implementation of
SRTF achieves system throughput to within 12.64% of Shortest Job First (SJF, an
oracle optimal scheduling policy), bridging 49% of the gap between FIFO and
SJF.Comment: 14 pages, full pre-review version of PACT 2014 poste
Scratchpad Sharing in GPUs
GPGPU applications exploit on-chip scratchpad memory available in the
Graphics Processing Units (GPUs) to improve performance. The amount of thread
level parallelism present in the GPU is limited by the number of resident
threads, which in turn depends on the availability of scratchpad memory in its
streaming multiprocessor (SM). Since the scratchpad memory is allocated at
thread block granularity, part of the memory may remain unutilized. In this
paper, we propose architectural and compiler optimizations to improve the
scratchpad utilization. Our approach, Scratchpad Sharing, addresses scratchpad
under-utilization by launching additional thread blocks in each SM. These
thread blocks use unutilized scratchpad and also share scratchpad with other
resident blocks. To improve the performance of scratchpad sharing, we propose
Owner Warp First (OWF) scheduling that schedules warps from the additional
thread blocks effectively. The performance of this approach, however, is
limited by the availability of the shared part of scratchpad.
We propose compiler optimizations to improve the availability of shared
scratchpad. We describe a scratchpad allocation scheme that helps in allocating
scratchpad variables such that shared scratchpad is accessed for short
duration. We introduce a new instruction, relssp, that when executed, releases
the shared scratchpad. Finally, we describe an analysis for optimal placement
of relssp instructions such that shared scratchpad is released as early as
possible.
We implemented the hardware changes using the GPGPU-Sim simulator and
implemented the compiler optimizations in Ocelot framework. We evaluated the
effectiveness of our approach on 19 kernels from 3 benchmarks suites: CUDA-SDK,
GPGPU-Sim, and Rodinia. The kernels that underutilize scratchpad memory show an
average improvement of 19% and maximum improvement of 92.17% compared to the
baseline approach
Primary vertex reconstruction using GPUs for the upgrade of the Inner Tracking System of the ALICE experiment at LHC
L'abstract è presente nell'allegato / the abstract is in the attachmen
Fundamentals
Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters
Fundamentals
Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters
Parallel and Distributed Computing
The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing
From experiment to design – fault characterization and detection in parallel computer systems using computational accelerators
This dissertation summarizes experimental validation and co-design studies conducted to optimize the fault detection capabilities and overheads in hybrid computer systems (e.g., using CPUs and Graphics Processing Units, or GPUs), and consequently to improve the scalability of parallel computer systems using computational accelerators. The experimental validation studies were conducted to help us understand the failure characteristics of CPU-GPU hybrid computer systems under various types of hardware faults. The main characterization targets were faults that are difficult to detect and/or recover from, e.g., faults that cause long latency failures (Ch. 3), faults in dynamically allocated resources (Ch. 4), faults in GPUs (Ch. 5), faults in MPI programs (Ch. 6), and microarchitecture-level faults with specific timing features (Ch. 7).
The co-design studies were based on the characterization results. One of the co-designed systems has a set of source-to-source translators that customize and strategically place error detectors in the source code of target GPU programs (Ch. 5). Another co-designed system uses an extension card to learn the normal behavioral and semantic execution patterns of message-passing processes executing on CPUs, and to detect abnormal behaviors of those parallel processes (Ch. 6). The third co-designed system is a co-processor that has a set of new instructions in order to support software-implemented fault detection techniques (Ch. 7).
The work described in this dissertation gains more importance because heterogeneous processors have become an essential component of state-of-the-art supercomputers. GPUs were used in three of the five fastest supercomputers that were operating in 2011. Our work included comprehensive fault characterization studies in CPU-GPU hybrid computers. In CPUs, we monitored the target systems for a long period of time after injecting faults (a temporally comprehensive experiment), and injected faults into various types of program states that included dynamically allocated memory (to be spatially comprehensive). In GPUs, we used fault injection studies to demonstrate the importance of detecting silent data corruption (SDC) errors that are mainly due to the lack of fine-grained protections and the massive use of fault-insensitive data. This dissertation also presents transparent fault tolerance frameworks and techniques that are directly applicable to hybrid computers built using only commercial off-the-shelf hardware components.
This dissertation shows that by developing understanding of the failure characteristics and error propagation paths of target programs, we were able to create fault tolerance frameworks and techniques that can quickly detect and recover from hardware faults with low performance and hardware overheads
Autotuning wavefront patterns for heterogeneous architectures
Manual tuning of applications for heterogeneous parallel systems is tedious and complex.
Optimizations are often not portable, and the whole process must be repeated when moving
to a new system, or sometimes even to a different problem size.
Pattern based parallel programming models were originally designed to provide programmers
with an abstract layer, hiding tedious parallel boilerplate code, and allowing a focus on
only application specific issues. However, the constrained algorithmic model associated with
each pattern also enables the creation of pattern-specific optimization strategies. These can
capture more complex variations than would be accessible by analysis of equivalent unstructured
source code. These variations create complex optimization spaces. Machine learning
offers well established techniques for exploring such spaces.
In this thesis we use machine learning to create autotuning strategies for heterogeneous
parallel implementations of applications which follow the wavefront pattern. In a wavefront,
computation starts from one corner of the problem grid and proceeds diagonally like a wave
to the opposite corner in either two or three dimensions. Our framework partitions and
optimizes the work created by these applications across systems comprising multicore CPUs
and multiple GPU accelerators. The tuning opportunities for a wavefront include controlling
the amount of computation to be offloaded onto GPU accelerators, choosing the number of
CPU and GPU threads to process tasks, tiling for both CPU and GPU memory structures,
and trading redundant halo computation against communication for multiple GPUs.
Our exhaustive search of the problem space shows that these parameters are very sensitive
to the combination of architecture, wavefront instance and problem size. We design and
investigate a family of autotuning strategies, targeting single and multiple CPU + GPU
systems, and both two and three dimensional wavefront instances. These yield an average
of 87% of the performance found by offline exhaustive search, with up to 99% in some cases