67 research outputs found
Non-Volatile Memory Array Based Quantization- and Noise-Resilient LSTM Neural Networks
In cloud and edge computing models, it is important that compute devices at
the edge be as power efficient as possible. Long short-term memory (LSTM)
neural networks have been widely used for natural language processing, time
series prediction and many other sequential data tasks. Thus, for these
applications there is increasing need for low-power accelerators for LSTM model
inference at the edge. In order to reduce power dissipation due to data
transfers within inference devices, there has been significant interest in
accelerating vector-matrix multiplication (VMM) operations using non-volatile
memory (NVM) weight arrays. In NVM array-based hardware, reduced bit-widths
also significantly increases the power efficiency. In this paper, we focus on
the application of quantization-aware training algorithm to LSTM models, and
the benefits these models bring in terms of resilience against both
quantization error and analog device noise. We have shown that only 4-bit NVM
weights and 4-bit ADC/DACs are needed to produce equivalent LSTM network
performance as floating-point baseline. Reasonable levels of ADC quantization
noise and weight noise can be naturally tolerated within our NVMbased quantized
LSTM network. Benchmark analysis of our proposed LSTM accelerator for inference
has shown at least 2.4x better computing efficiency and 40x higher area
efficiency than traditional digital approaches (GPU, FPGA, and ASIC). Some
other novel approaches based on NVM promise to deliver higher computing
efficiency (up to 4.7x) but require larger arrays with potential higher error
rates.Comment: Published in: 2019 IEEE International Conference on Rebooting
Computing (ICRC
TL-nvSRAM-CIM: Ultra-High-Density Three-Level ReRAM-Assisted Computing-in-nvSRAM with DC-Power Free Restore and Ternary MAC Operations
Accommodating all the weights on-chip for large-scale NNs remains a great
challenge for SRAM based computing-in-memory (SRAM-CIM) with limited on-chip
capacity. Previous non-volatile SRAM-CIM (nvSRAM-CIM) addresses this issue by
integrating high-density single-level ReRAMs on the top of high-efficiency
SRAM-CIM for weight storage to eliminate the off-chip memory access. However,
previous SL-nvSRAM-CIM suffers from poor scalability for an increased number of
SL-ReRAMs and limited computing efficiency. To overcome these challenges, this
work proposes an ultra-high-density three-level ReRAMs-assisted
computing-in-nonvolatile-SRAM (TL-nvSRAM-CIM) scheme for large NN models. The
clustered n-selector-n-ReRAM (cluster-nSnRs) is employed for reliable
weight-restore with eliminated DC power. Furthermore, a ternary SRAM-CIM
mechanism with differential computing scheme is proposed for energy-efficient
ternary MAC operations while preserving high NN accuracy. The proposed
TL-nvSRAM-CIM achieves 7.8x higher storage density, compared with the
state-of-art works. Moreover, TL-nvSRAM-CIM shows up to 2.9x and 1.9x enhanced
energy-efficiency, respectively, compared to the baseline designs of SRAM-CIM
and ReRAM-CIM, respectively
Enabling Neuromorphic Computing for Artificial Intelligence with Hardware-Software Co-Design
In the last decade, neuromorphic computing was rebirthed with the emergence of novel nano-devices and hardware-software co-design approaches. With the fast advancement in algorithms for today’s artificial intelligence (AI) applications, deep neural networks (DNNs) have become the mainstream technology. It has been a new research trend to enable neuromorphic designs for DNNs computing with high computing efficiency in speed and energy. In this chapter, we will summarize the recent advances in neuromorphic computing hardware and system designs with non-volatile resistive access memory (ReRAM) devices. More specifically, we will discuss the ReRAM-based neuromorphic computing hardware and system implementations, hardware-software co-design approaches for quantized and sparse DNNs, and architecture designs
Block-Wise Mixed-Precision Quantization: Enabling High Efficiency for Practical ReRAM-based DNN Accelerators
Resistive random access memory (ReRAM)-based processing-in-memory (PIM)
architectures have demonstrated great potential to accelerate Deep Neural
Network (DNN) training/inference. However, the computational accuracy of analog
PIM is compromised due to the non-idealities, such as the conductance variation
of ReRAM cells. The impact of these non-idealities worsens as the number of
concurrently activated wordlines and bitlines increases. To guarantee
computational accuracy, only a limited number of wordlines and bitlines of the
crossbar array can be turned on concurrently, significantly reducing the
achievable parallelism of the architecture.
While the constraints on parallelism limit the efficiency of the
accelerators, they also provide a new opportunity for fine-grained
mixed-precision quantization. To enable efficient DNN inference on practical
ReRAM-based accelerators, we propose an algorithm-architecture co-design
framework called \underline{B}lock-\underline{W}ise mixed-precision
\underline{Q}uantization (BWQ). At the algorithm level, BWQ-A introduces a
mixed-precision quantization scheme at the block level, which achieves a high
weight and activation compression ratio with negligible accuracy degradation.
We also present the hardware architecture design BWQ-H, which leverages the
low-bit-width models achieved by BWQ-A to perform high-efficiency DNN inference
on ReRAM devices. BWQ-H also adopts a novel precision-aware weight mapping
method to increase the ReRAM crossbar's throughput. Our evaluation demonstrates
the effectiveness of BWQ, which achieves a 6.08x speedup and a 17.47x energy
saving on average compared to existing ReRAM-based architectures.Comment: 12 pages, 13 figure
ReDy: A Novel ReRAM-centric Dynamic Quantization Approach for Energy-efficient CNN Inference
The primary operation in DNNs is the dot product of quantized input
activations and weights. Prior works have proposed the design of memory-centric
architectures based on the Processing-In-Memory (PIM) paradigm. Resistive RAM
(ReRAM) technology is especially appealing for PIM-based DNN accelerators due
to its high density to store weights, low leakage energy, low read latency, and
high performance capabilities to perform the DNN dot-products massively in
parallel within the ReRAM crossbars. However, the main bottleneck of these
architectures is the energy-hungry analog-to-digital conversions (ADCs)
required to perform analog computations in-ReRAM, which penalizes the
efficiency and performance benefits of PIM. To improve energy-efficiency of
in-ReRAM analog dot-product computations we present ReDy, a hardware
accelerator that implements a ReRAM-centric Dynamic quantization scheme to take
advantage of the bit serial streaming and processing of activations. The energy
consumption of ReRAM-based DNN accelerators is directly proportional to the
numerical precision of the input activations of each DNN layer. In particular,
ReDy exploits that activations of CONV layers from Convolutional Neural
Networks (CNNs), a subset of DNNs, are commonly grouped according to the size
of their filters and the size of the ReRAM crossbars. Then, ReDy quantizes
on-the-fly each group of activations with a different numerical precision based
on a novel heuristic that takes into account the statistical distribution of
each group. Overall, ReDy greatly reduces the activity of the ReRAM crossbars
and the number of A/D conversions compared to an static 8-bit uniform
quantization. We evaluate ReDy on a popular set of modern CNNs. On average,
ReDy provides 13\% energy savings over an ISAAC-like accelerator with
negligible accuracy loss and area overhead.Comment: 13 pages, 16 figures, 4 Table
Swordfish: A Framework for Evaluating Deep Neural Network-based Basecalling using Computation-In-Memory with Non-Ideal Memristors
Basecalling, an essential step in many genome analysis studies, relies on
large Deep Neural Networks (DNNs) to achieve high accuracy. Unfortunately,
these DNNs are computationally slow and inefficient, leading to considerable
delays and resource constraints in the sequence analysis process. A
Computation-In-Memory (CIM) architecture using memristors can significantly
accelerate the performance of DNNs. However, inherent device non-idealities and
architectural limitations of such designs can greatly degrade the basecalling
accuracy, which is critical for accurate genome analysis. To facilitate the
adoption of memristor-based CIM designs for basecalling, it is important to (1)
conduct a comprehensive analysis of potential CIM architectures and (2) develop
effective strategies for mitigating the possible adverse effects of inherent
device non-idealities and architectural limitations.
This paper proposes Swordfish, a novel hardware/software co-design framework
that can effectively address the two aforementioned issues. Swordfish
incorporates seven circuit and device restrictions or non-idealities from
characterized real memristor-based chips. Swordfish leverages various
hardware/software co-design solutions to mitigate the basecalling accuracy loss
due to such non-idealities. To demonstrate the effectiveness of Swordfish, we
take Bonito, the state-of-the-art (i.e., accurate and fast), open-source
basecaller as a case study. Our experimental results using Sword-fish show that
a CIM architecture can realistically accelerate Bonito for a wide range of real
datasets by an average of 25.7x, with an accuracy loss of 6.01%.Comment: To appear in 56th IEEE/ACM International Symposium on
Microarchitecture (MICRO), 202
- …