30 research outputs found
FAT: An In-Memory Accelerator with Fast Addition for Ternary Weight Neural Networks
Convolutional Neural Networks (CNNs) demonstrate excellent performance in
various applications but have high computational complexity. Quantization is
applied to reduce the latency and storage cost of CNNs. Among the quantization
methods, Binary and Ternary Weight Networks (BWNs and TWNs) have a unique
advantage over 8-bit and 4-bit quantization. They replace the multiplication
operations in CNNs with additions, which are favoured on In-Memory-Computing
(IMC) devices. IMC acceleration for BWNs has been widely studied. However,
though TWNs have higher accuracy and better sparsity than BWNs, IMC
acceleration for TWNs has limited research. TWNs on existing IMC devices are
inefficient because the sparsity is not well utilized, and the addition
operation is not efficient.
In this paper, we propose FAT as a novel IMC accelerator for TWNs. First, we
propose a Sparse Addition Control Unit, which utilizes the sparsity of TWNs to
skip the null operations on zero weights. Second, we propose a fast addition
scheme based on the memory Sense Amplifier to avoid the time overhead of both
carry propagation and writing back the carry to memory cells. Third, we further
propose a Combined-Stationary data mapping to reduce the data movement of
activations and weights and increase the parallelism across memory columns.
Simulation results show that for addition operations at the Sense Amplifier
level, FAT achieves 2.00X speedup, 1.22X power efficiency, and 1.22X area
efficiency compared with a State-Of-The-Art IMC accelerator ParaPIM. FAT
achieves 10.02X speedup and 12.19X energy efficiency compared with ParaPIM on
networks with 80% average sparsity.Comment: 14 page
System and Design Technology Co-optimization of SOT-MRAM for High-Performance AI Accelerator Memory System
SoCs are now designed with their own AI accelerator segment to accommodate
the ever-increasing demand of Deep Learning (DL) applications. With powerful
MAC engines for matrix multiplications, these accelerators show high computing
performance. However, because of limited memory resources (i.e., bandwidth and
capacity), they fail to achieve optimum system performance during large batch
training and inference. In this work, we propose a memory system with high
on-chip capacity and bandwidth to shift the gear of AI accelerators from
memory-bound to achieving system-level peak performance. We develop the memory
system with DTCO-enabled customized SOT-MRAM as large on-chip memory through
STCO and detailed characterization of the DL workloads. %We evaluate our
workload-aware memory system on the CV and NLP benchmarks and observe
significant PPA improvement compared to an SRAM-based in both inference and
training modes. Our workload-aware memory system achieves 8X energy and 9X
latency improvement on Computer Vision (CV) benchmarks in training and 8X
energy and 4.5X latency improvement on Natural Language Processing (NLP)
benchmarks in training while consuming only around 50% of SRAM area at
iso-capacity
Design Space Exploration and Comparative Evaluation of Memory Technologies for Synaptic Crossbar Arrays: Device-Circuit Non-Idealities and System Accuracy
In-memory computing (IMC) utilizing synaptic crossbar arrays is promising for
deep neural networks to attain high energy efficiency and integration density.
Towards that end, various CMOS and post-CMOS technologies have been explored as
promising synaptic device candidates which include SRAM, ReRAM, FeFET,
SOT-MRAM, etc. However, each of these technologies has its own pros and cons,
which need to be comparatively evaluated in the context of synaptic array
designs. For a fair comparison, such an analysis must carefully optimize each
technology, specifically for synaptic crossbar design accounting for device and
circuit non-idealities in crossbar arrays such as variations, wire resistance,
driver/sink resistance, etc. In this work, we perform a comprehensive design
space exploration and comparative evaluation of different technologies at 7nm
technology node for synaptic crossbar arrays, in the context of IMC robustness
and system accuracy. Firstly, we integrate different technologies into a
cross-layer simulation flow based on physics-based models of synaptic devices
and interconnects. Secondly, we optimize both technology-agnostic design knobs
such as input encoding and ON-resistance as well as technology-specific design
parameters including ferroelectric thickness in FeFET and MgO thickness in
SOT-MRAM. Our optimization methodology accounts for the implications of device-
and circuit-level non-idealities on the system-level accuracy for each
technology. Finally, based on the optimized designs, we obtain inference
results for ResNet-20 on CIFAR-10 dataset and show that FeFET-based crossbar
arrays achieve the highest accuracy due to their compactness, low leakage and
high ON/OFF current ratio
A Phase Change Memory and DRAM Based Framework For Energy-Efficient and High-Speed In-Memory Stochastic Computing
Convolutional Neural Networks (CNNs) have proven to be highly effective in various fields related to Artificial Intelligence (AI) and Machine Learning (ML). However, the significant computational and memory requirements of CNNs make their processing highly compute and memory-intensive. In particular, the multiply-accumulate (MAC) operation, which is a fundamental building block of CNNs, requires enormous arithmetic operations. As the input dataset size increases, the traditional processor-centric von-Neumann computing architecture becomes ill-suited for CNN-based applications. This results in exponentially higher latency and energy costs, making the processing of CNNs highly challenging.
To overcome these challenges, researchers have explored the Processing-In Memory (PIM) technique, which involves placing the processing unit inside or near the memory unit. This approach reduces data migration length and utilizes the internal memory bandwidth at the memory chip level. However, developing a reliable PIM-based system with minimal hardware modifications and design complexity remains a significant challenge.
The proposed solution in the report suggests utilizing different memory technologies, such as Dynamic RAM (DRAM) and phase change memory (PCM), with Stochastic arithmetic and minimal add-on logic. Stochastic computing is a technique that uses random numbers to perform arithmetic operations instead of traditional binary representation. This technique reduces hardware requirements for CNN\u27s arithmetic operations, making it possible to implement them with minimal add-on logic.
The report details the workflow for performing arithmetical operations used by CNNs, including MAC, activation, and floating-point functions. The proposed solution includes designs for scalable Stochastic Number Generator (SNG), DRAM CNN accelerator, non-volatile memory (NVM) class PCRAM-based CNN accelerator, and DRAM-based stochastic to binary conversion (StoB) for in-situ deep learning. These designs utilize stochastic computing to reduce the hardware requirements for CNN\u27s arithmetic operations and enable energy and time-efficient processing of CNNs.
The report also identifies future research directions for the proposed designs, including in-situ PCRAM-based SNG, ODIN (A Bit-Parallel Stochastic Arithmetic Based Accelerator for In-Situ Neural Network Processing in Phase Change RAM), ATRIA (Bit-Parallel Stochastic Arithmetic Based Accelerator for In-DRAM CNN Processing), and AGNI (In-Situ, Iso-Latency Stochastic-to-Binary Number Conversion for In-DRAM Deep Learning), and presents initial findings for these ideas.
In summary, the proposed solution in the report offers a comprehensive approach to address the challenges of processing CNNs, and the proposed designs have the potential to improve the energy and time efficiency of CNNs significantly. Using Stochastic Computing and different memory technologies enables the development of reliable PIM-based systems with minimal hardware modifications and design complexity, providing a promising path for the future of CNN-based applications
Energy-Efficient In-Memory Architectures Leveraging Intrinsic Behaviors of Embedded MRAM Devices
For decades, innovations to surmount the processor versus memory gap and move beyond conventional von Neumann architectures continue to be sought and explored. Recent machine learning models still expend orders of magnitude more time and energy to access data in memory in addition to merely performing the computation itself. This phenomenon referred to as a memory-wall bottleneck, is addressed herein via a completely fresh perspective on logic and memory technology design. The specific solutions developed in this dissertation focus on utilizing intrinsic switching behaviors of embedded MRAM devices to design cross-layer and energy-efficient Compute-in-Memory (CiM) architectures, accelerate the computationally-intensive operations in various Artificial Neural Networks (ANNs), achieve higher density and reduce the power consumption as crucial requirements in future Internet of Things (IoT) devices. The first cross-layer platform developed herein is an Approximate Generative Adversarial Network (ApGAN) designed to accelerate the Generative Adversarial Networks from both algorithm and hardware implementation perspectives. In addition to binarizing the weights, further reduction in storage and computation resources is achieved by leveraging an in-memory addition scheme. Moreover, a memristor-based CiM accelerator for ApGAN is developed. The second design is a biologically-inspired memory architecture. The Short-Term Memory and Long-Term Memory features in biology are realized in hardware via a beyond-CMOS-based learning approach derived from the repeated input information and retrieval of the encoded data. The third cross-layer architecture is a programmable energy-efficient hardware implementation for Recurrent Neural Network with ultra-low power, area-efficient spin-based activation functions. A novel CiM architecture is proposed to leverage data-level parallelism during the evaluation phase. Specifically, we employ an MRAM-based Adjustable Probabilistic Activation Function (APAF) via a low-power tunable activation mechanism, providing adjustable accuracy levels to mimic ideal sigmoid and tanh thresholding along with a matching algorithm to regulate neuronal properties. Finally, the APAF design is utilized in the Long Short-Term Memory (LSTM) network to evaluate the network performance using binary and non-binary activation functions. The simulation results indicate up to 74.5 x 215; energy-efficiency, 35-fold speedup and ~11x area reduction compared with the similar baseline designs. These can form basis for future post-CMOS based non-Von Neumann architectures suitable for intermittently powered energy harvesting devices capable of pushing intelligence towards the edge of computing network