30 research outputs found

    FAT: An In-Memory Accelerator with Fast Addition for Ternary Weight Neural Networks

    Full text link
    Convolutional Neural Networks (CNNs) demonstrate excellent performance in various applications but have high computational complexity. Quantization is applied to reduce the latency and storage cost of CNNs. Among the quantization methods, Binary and Ternary Weight Networks (BWNs and TWNs) have a unique advantage over 8-bit and 4-bit quantization. They replace the multiplication operations in CNNs with additions, which are favoured on In-Memory-Computing (IMC) devices. IMC acceleration for BWNs has been widely studied. However, though TWNs have higher accuracy and better sparsity than BWNs, IMC acceleration for TWNs has limited research. TWNs on existing IMC devices are inefficient because the sparsity is not well utilized, and the addition operation is not efficient. In this paper, we propose FAT as a novel IMC accelerator for TWNs. First, we propose a Sparse Addition Control Unit, which utilizes the sparsity of TWNs to skip the null operations on zero weights. Second, we propose a fast addition scheme based on the memory Sense Amplifier to avoid the time overhead of both carry propagation and writing back the carry to memory cells. Third, we further propose a Combined-Stationary data mapping to reduce the data movement of activations and weights and increase the parallelism across memory columns. Simulation results show that for addition operations at the Sense Amplifier level, FAT achieves 2.00X speedup, 1.22X power efficiency, and 1.22X area efficiency compared with a State-Of-The-Art IMC accelerator ParaPIM. FAT achieves 10.02X speedup and 12.19X energy efficiency compared with ParaPIM on networks with 80% average sparsity.Comment: 14 page

    System and Design Technology Co-optimization of SOT-MRAM for High-Performance AI Accelerator Memory System

    Full text link
    SoCs are now designed with their own AI accelerator segment to accommodate the ever-increasing demand of Deep Learning (DL) applications. With powerful MAC engines for matrix multiplications, these accelerators show high computing performance. However, because of limited memory resources (i.e., bandwidth and capacity), they fail to achieve optimum system performance during large batch training and inference. In this work, we propose a memory system with high on-chip capacity and bandwidth to shift the gear of AI accelerators from memory-bound to achieving system-level peak performance. We develop the memory system with DTCO-enabled customized SOT-MRAM as large on-chip memory through STCO and detailed characterization of the DL workloads. %We evaluate our workload-aware memory system on the CV and NLP benchmarks and observe significant PPA improvement compared to an SRAM-based in both inference and training modes. Our workload-aware memory system achieves 8X energy and 9X latency improvement on Computer Vision (CV) benchmarks in training and 8X energy and 4.5X latency improvement on Natural Language Processing (NLP) benchmarks in training while consuming only around 50% of SRAM area at iso-capacity

    Design Space Exploration and Comparative Evaluation of Memory Technologies for Synaptic Crossbar Arrays: Device-Circuit Non-Idealities and System Accuracy

    Full text link
    In-memory computing (IMC) utilizing synaptic crossbar arrays is promising for deep neural networks to attain high energy efficiency and integration density. Towards that end, various CMOS and post-CMOS technologies have been explored as promising synaptic device candidates which include SRAM, ReRAM, FeFET, SOT-MRAM, etc. However, each of these technologies has its own pros and cons, which need to be comparatively evaluated in the context of synaptic array designs. For a fair comparison, such an analysis must carefully optimize each technology, specifically for synaptic crossbar design accounting for device and circuit non-idealities in crossbar arrays such as variations, wire resistance, driver/sink resistance, etc. In this work, we perform a comprehensive design space exploration and comparative evaluation of different technologies at 7nm technology node for synaptic crossbar arrays, in the context of IMC robustness and system accuracy. Firstly, we integrate different technologies into a cross-layer simulation flow based on physics-based models of synaptic devices and interconnects. Secondly, we optimize both technology-agnostic design knobs such as input encoding and ON-resistance as well as technology-specific design parameters including ferroelectric thickness in FeFET and MgO thickness in SOT-MRAM. Our optimization methodology accounts for the implications of device- and circuit-level non-idealities on the system-level accuracy for each technology. Finally, based on the optimized designs, we obtain inference results for ResNet-20 on CIFAR-10 dataset and show that FeFET-based crossbar arrays achieve the highest accuracy due to their compactness, low leakage and high ON/OFF current ratio

    A Phase Change Memory and DRAM Based Framework For Energy-Efficient and High-Speed In-Memory Stochastic Computing

    Get PDF
    Convolutional Neural Networks (CNNs) have proven to be highly effective in various fields related to Artificial Intelligence (AI) and Machine Learning (ML). However, the significant computational and memory requirements of CNNs make their processing highly compute and memory-intensive. In particular, the multiply-accumulate (MAC) operation, which is a fundamental building block of CNNs, requires enormous arithmetic operations. As the input dataset size increases, the traditional processor-centric von-Neumann computing architecture becomes ill-suited for CNN-based applications. This results in exponentially higher latency and energy costs, making the processing of CNNs highly challenging. To overcome these challenges, researchers have explored the Processing-In Memory (PIM) technique, which involves placing the processing unit inside or near the memory unit. This approach reduces data migration length and utilizes the internal memory bandwidth at the memory chip level. However, developing a reliable PIM-based system with minimal hardware modifications and design complexity remains a significant challenge. The proposed solution in the report suggests utilizing different memory technologies, such as Dynamic RAM (DRAM) and phase change memory (PCM), with Stochastic arithmetic and minimal add-on logic. Stochastic computing is a technique that uses random numbers to perform arithmetic operations instead of traditional binary representation. This technique reduces hardware requirements for CNN\u27s arithmetic operations, making it possible to implement them with minimal add-on logic. The report details the workflow for performing arithmetical operations used by CNNs, including MAC, activation, and floating-point functions. The proposed solution includes designs for scalable Stochastic Number Generator (SNG), DRAM CNN accelerator, non-volatile memory (NVM) class PCRAM-based CNN accelerator, and DRAM-based stochastic to binary conversion (StoB) for in-situ deep learning. These designs utilize stochastic computing to reduce the hardware requirements for CNN\u27s arithmetic operations and enable energy and time-efficient processing of CNNs. The report also identifies future research directions for the proposed designs, including in-situ PCRAM-based SNG, ODIN (A Bit-Parallel Stochastic Arithmetic Based Accelerator for In-Situ Neural Network Processing in Phase Change RAM), ATRIA (Bit-Parallel Stochastic Arithmetic Based Accelerator for In-DRAM CNN Processing), and AGNI (In-Situ, Iso-Latency Stochastic-to-Binary Number Conversion for In-DRAM Deep Learning), and presents initial findings for these ideas. In summary, the proposed solution in the report offers a comprehensive approach to address the challenges of processing CNNs, and the proposed designs have the potential to improve the energy and time efficiency of CNNs significantly. Using Stochastic Computing and different memory technologies enables the development of reliable PIM-based systems with minimal hardware modifications and design complexity, providing a promising path for the future of CNN-based applications

    Energy-Efficient In-Memory Architectures Leveraging Intrinsic Behaviors of Embedded MRAM Devices

    Get PDF
    For decades, innovations to surmount the processor versus memory gap and move beyond conventional von Neumann architectures continue to be sought and explored. Recent machine learning models still expend orders of magnitude more time and energy to access data in memory in addition to merely performing the computation itself. This phenomenon referred to as a memory-wall bottleneck, is addressed herein via a completely fresh perspective on logic and memory technology design. The specific solutions developed in this dissertation focus on utilizing intrinsic switching behaviors of embedded MRAM devices to design cross-layer and energy-efficient Compute-in-Memory (CiM) architectures, accelerate the computationally-intensive operations in various Artificial Neural Networks (ANNs), achieve higher density and reduce the power consumption as crucial requirements in future Internet of Things (IoT) devices. The first cross-layer platform developed herein is an Approximate Generative Adversarial Network (ApGAN) designed to accelerate the Generative Adversarial Networks from both algorithm and hardware implementation perspectives. In addition to binarizing the weights, further reduction in storage and computation resources is achieved by leveraging an in-memory addition scheme. Moreover, a memristor-based CiM accelerator for ApGAN is developed. The second design is a biologically-inspired memory architecture. The Short-Term Memory and Long-Term Memory features in biology are realized in hardware via a beyond-CMOS-based learning approach derived from the repeated input information and retrieval of the encoded data. The third cross-layer architecture is a programmable energy-efficient hardware implementation for Recurrent Neural Network with ultra-low power, area-efficient spin-based activation functions. A novel CiM architecture is proposed to leverage data-level parallelism during the evaluation phase. Specifically, we employ an MRAM-based Adjustable Probabilistic Activation Function (APAF) via a low-power tunable activation mechanism, providing adjustable accuracy levels to mimic ideal sigmoid and tanh thresholding along with a matching algorithm to regulate neuronal properties. Finally, the APAF design is utilized in the Long Short-Term Memory (LSTM) network to evaluate the network performance using binary and non-binary activation functions. The simulation results indicate up to 74.5 x 215; energy-efficiency, 35-fold speedup and ~11x area reduction compared with the similar baseline designs. These can form basis for future post-CMOS based non-Von Neumann architectures suitable for intermittently powered energy harvesting devices capable of pushing intelligence towards the edge of computing network
    corecore