22 research outputs found
Reliable and Energy Efficient MLC STT-RAM Buffer for CNN Accelerators
We propose a lightweight scheme where the formation of a data block is changed in such a way that it can tolerate soft errors significantly better than the baseline. The key insight behind our work is that CNN weights are normalized between -1 and 1 after each convolutional layer, and this leaves one bit unused in half-precision floating-point representation. By taking advantage of the unused bit, we create a backup for the most significant bit to protect it against the soft errors. Also, considering the fact that in MLC STT-RAMs the cost of memory operations (read and write), and reliability of a cell are content-dependent (some patterns take larger current and longer time, while they are more susceptible to soft error), we rearrange the data block to minimize the number of costly bit patterns. Combining these two techniques provides the same level of accuracy compared to an error-free baseline while improving the read and write energy by 9% and 6%, respectively
Implications of Noise in Resistive Memory on Deep Neural Networks for Image Classification
Resistive memory is a promising alternative to SRAM, but is also an
inherently unstable device that requires substantial effort to ensure correct
read and write operations. To avoid the associated costs in terms of area, time
and energy, the present work is concerned with exploring how much noise in
memory operations can be tolerated by image classification tasks based on
neural networks. We introduce a special noisy operator that mimics the noise in
an exemplary resistive memory unit, explore the resilience of convolutional
neural networks on the CIFAR-10 classification task, and discuss a couple of
countermeasures to improve this resilience
OCC:An Automated End-to-End Machine Learning Optimizing Compiler for Computing-In-Memory
Memristive devices promise an alternative approach toward non-Von Neumann architectures, where specific computational tasks are performed within the memory devices. In the machine learning (ML) domain, crossbar arrays of resistive devices have shown great promise for ML inference, as they allow for hardware acceleration of matrix multiplications. But, to enable widespread adoption of these novel architectures, it is critical to have an automatic compilation flow as opposed to relying on a manual mapping of specific kernels on the crossbar arrays. We demonstrate the programmability of memristor-based accelerators using the new compiler design principle of multilevel rewriting, where a hierarchy of abstractions lowers programs level-by-level and perform code transformations at the most suitable abstraction. In particular, we develop a prototype compiler, which progressively lowers a mathematical notation for tensor operations arising in ML workloads, to fixed-function memristor-based hardware blocks.</p
PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation
Bit-serial Processing-In-Memory (PIM) is an attractive paradigm for
accelerator architectures, for parallel workloads such as Deep Learning (DL),
because of its capability to achieve massive data parallelism at a low area
overhead and provide orders-of-magnitude data movement savings by moving
computational resources closer to the data. While many PIM architectures have
been proposed, improvements are needed in communicating intermediate results to
consumer kernels, for communication between tiles at scale, for reduction
operations, and for efficiently performing bit-serial operations with
constants.
We present PIMSAB, a scalable architecture that provides spatially aware
communication network for efficient intra-tile and inter-tile data movement and
provides efficient computation support for generally inefficient bit-serial
compute patterns. Our architecture consists of a massive hierarchical array of
compute-enabled SRAMs (CRAMs) and is codesigned with a compiler to achieve high
utilization. The key novelties of our architecture are: (1) providing efficient
support for spatially-aware communication by providing local H-tree network for
reductions, by adding explicit hardware for shuffling operands, and by
deploying systolic broadcasting, and (2) taking advantage of the divisible
nature of bit-serial computations through adaptive precision, bit-slicing and
efficient handling of constant operations.
When compared against a similarly provisioned modern Tensor Core GPU (NVIDIA
A100), across common DL kernels and an end-to-end DL network (Resnet18), PIMSAB
outperforms the GPU by 3x, and reduces energy by 4.2x. We compare PIMSAB with
similarly provisioned state-of-the-art SRAM PIM (Duality Cache) and DRAM PIM
(SIMDRAM) and observe a speedup of 3.7x and 3.88x respectively.Comment: Aman Arora and Jian Weng are co-first authors with equal contributio
Low-Cost Floating-Point Processing in ReRAM for Scientific Computing
We propose ReFloat, a principled approach for low-cost floating-point
processing in ReRAM. The exponent offsets based on a base are stored by a
flexible and fine-grained floating-point number representation. The key
motivation is that, while the number of exponent bits must be reduced due to
the exponential relation to the computation latency and hardware cost, the
convergence still requires sufficient accuracy for exponents. Our design
reconciles the conflicting goals by storing the exponent offsets from a common
base among matrix values in a block, which is the granularity of computation in
ReRAM. Due to the value locality, the differences among the exponents in a
block are small, thus the offsets require much less number of bits to represent
exponents. In essence, ReFloat enables the principled local fine-tuning of
floating-point representation. Based on the idea, we define a flexible ReFloat
format that specifies matrix block size, and the number of bits for exponent
and fraction. To determine the base for each block, we propose an optimization
method that minimizes the difference between the exponents of the original
matrix block and the converted block. We develop the conversion scheme from
default double-precision floating-point format to ReFloat format, the
computation procedure, and the low-cost floating-point processing architecture
in ReRAM