8 research outputs found
Design and Analysis of Soft-Error Resilience Mechanisms for GPU Register File
Modern graphics processing units (GPUs) are
using increasingly larger register file (RF) which occupies
a
large fraction of GPU core area and is very frequently access
ed.
This makes RF vulnerable to soft-errors (SE). In this paper,
we
present two techniques for improving SE resilience of GPU RF
.
First, we propose compressing the RF values for reducing the
number of vulnerable bits. We leverage value similarity and
the presence of narrow-width values to perform compression
at
warp or thread-level, respectively. Second, we propose sel
ective
hardening to design a portion of register entry with SE immun
e
circuits. By collectively using these techniques, higher r
esilience
can be provided with lower overhead. Without hardening, our
warp and thread-level compression techniques bring 47.0%
and 40.8% reduction in SE vulnerability, respectively
A GPU Register File using Static Data Compression
GPUs rely on large register files to unlock thread-level parallelism for high
throughput. Unfortunately, large register files are power hungry, making it
important to seek for new approaches to improve their utilization.
This paper introduces a new register file organization for efficient
register-packing of narrow integer and floating-point operands designed to
leverage on advances in static analysis. We show that the hardware/software
co-designed register file organization yields a performance improvement of up
to 79%, and 18.6%, on average, at a modest output-quality degradation.Comment: Accepted to ICPP'2
A Survey of Techniques for Improving Security of GPUs
Graphics processing unit (GPU), although a powerful performance-booster, also
has many security vulnerabilities. Due to these, the GPU can act as a
safe-haven for stealthy malware and the weakest `link' in the security `chain'.
In this paper, we present a survey of techniques for analyzing and improving
GPU security. We classify the works on key attributes to highlight their
similarities and differences. More than informing users and researchers about
GPU security techniques, this survey aims to increase their awareness about GPU
security vulnerabilities and potential countermeasures
Approximation and Compression Techniques to Enhance Performance of Graphics Processing Units
A key challenge in modern computing systems is to access data fast enough to fully utilize the computing elements in the chip. In Graphics Processing Units (GPUs), the performance is often constrained by register file size, memory bandwidth, and the capacity of the main memory. One important technique towards alleviating this challenge is data compression. By reducing the amount of data that needs to be communicated or stored, memory resources crucial for performance can be efficiently utilized.This thesis provides a set of approximation and compression techniques for GPUs, with the goal of efficiently utilizing the computational fabric, and thereby increase performance. The thesis shows that these techniques can substantially lower the amount of information the system has to process, and are thus important tools in the process of meeting challenges in memory utilization.This thesis makes contributions within three areas: controlled floating-point precision reduction, lossless and lossy memory compression, and distributed training of neural networks. In the first area, the thesis shows that through automated and controlled floating-point approximation, the register file can be more efficiently utilized. This is achieved through a framework which establishes a cross-layer connection between the application and the microarchitecture layer, and a novel register file organization capable of leveraging low-precision floating-point values and narrow integers for increased capacity and performance.Within the area of compression, this thesis aims at increasing the effective bandwidth of GPUs by presenting a lossless and lossy memory compression algorithm to reduce the amount of transferred data. In contrast to state-of-the-art compression techniques such as Base-Delta-Immediate and Bitplane Compression, which uses intra-block bases for compression, the proposed algorithm leverages multiple global base values to reach a higher compression ratio. The algorithm includes an optional approximation step for floating-point values which offers higher compression ratio at a given, low, error rate.Finally, within the area of distributed training of neural networks, this thesis proposes a subgraph approximation scheme for graph data which mitigates accuracy loss in a distributed setting. The scheme allows neural network models that use graphs as inputs to converge at single-machine accuracy, while minimizing synchronization overhead between the machines
Hardware-Software Stack for an RC car for testing autonomous driving algorithms
In this paper, we report our ongoing work on developing hardware and software support for a toy RC car. This toy car can be used as a platform for evaluating algorithms and accelerators for autonomous driving vehicles (ADVs). We describe different sensors and actuators used and interfacing of them with two processors, viz., Jetson Nano and Raspberry Pi. Where possible, we have used ROS nodes for interfacing. We discuss the advantages and limitations of different sensors and processors and issues related to their compatibility. We include both software (e.g., python code, linux commands) and hardware (e.g., pin configuration) information which will be useful for reproducing the experiments. This paper will be useful for robotics enthusiasts and researchers in the area of autonomous driving
Convolutional Neural Network Acceleration on GPU by Exploiting Data Reuse
Graphical processing units (GPUs) achieve high throughput with hundreds of cores for concurrent execution and a large register file for storing the context of thousands of threads. Deep learning algorithms have recently gained popularity for their capability for solving complex problems without programmer intervention. Deep learning algorithms operate with a massive amount of input data that causes high memory access overhead. In the convolutional layer of the deep learning network, there exists a unique pattern of data access and reuse, which is not effectively utilized by the GPU architecture. These abundant redundant memory accesses lead to a significant power and performance overhead. In this thesis, I maintained redundant data in a faster on-chip memory, register file, so that the data that are used by multiple neurons can be directly fetched from the register file without cumbersome system memory accesses. In this method, a neuron’s load instruction is replaced by a shuffle instruction if the data are found from the register file. To enable data sharing in the register file, a new register type was used as a destination register of load instructions. By using the unique ID of the new load destination registers, neurons can easily find their data in the register file. By exploiting the underutilized register file space, this method does not impose any area or power overhead on the register file design. The effectiveness of the new idea was evaluated through exhaustive experiments. According to the results, the new idea significantly improved performance and energy efficiency compared to baseline architecture and shared memory version solution
DESTINY: A Comprehensive Tool with 3D and Multi-Level Cell Memory Modeling Capability
To enable the design of large capacity memory structures, novel memory technologies such as non-volatile memory (NVM) and novel fabrication approaches, e.g., 3D stacking and multi-level cell (MLC) design have been explored. The existing modeling tools, however, cover only a few memory technologies, technology nodes and fabrication approaches. We present DESTINY, a tool for modeling 2D/3D memories designed using SRAM, resistive RAM (ReRAM), spin transfer torque RAM (STT-RAM), phase change RAM (PCM) and embedded DRAM (eDRAM) and 2D memories designed using spin orbit torque RAM (SOT-RAM), domain wall memory (DWM) and Flash memory. In addition to single-level cell (SLC) designs for all of these memories, DESTINY also supports modeling MLC designs for NVMs. We have extensively validated DESTINY against commercial and research prototypes of these memories. DESTINY is very useful for performing design-space exploration across several dimensions, such as optimizing for a target (e.g., latency, area or energy-delay product) for a given memory technology, choosing the suitable memory technology or fabrication method (i.e., 2D v/s 3D) for a given optimization target, etc. We believe that DESTINY will boost studies of next-generation memory architectures used in systems ranging from mobile devices to extreme-scale supercomputers. The latest source-code of DESTINY is available from the following git repository: https://bitbucket.org/sparsh_mittal/destiny_v2