7 research outputs found
Towards Efficient In-memory Computing Hardware for Quantized Neural Networks: State-of-the-art, Open Challenges and Perspectives
The amount of data processed in the cloud, the development of
Internet-of-Things (IoT) applications, and growing data privacy concerns force
the transition from cloud-based to edge-based processing. Limited energy and
computational resources on edge push the transition from traditional von
Neumann architectures to In-memory Computing (IMC), especially for machine
learning and neural network applications. Network compression techniques are
applied to implement a neural network on limited hardware resources.
Quantization is one of the most efficient network compression techniques
allowing to reduce the memory footprint, latency, and energy consumption. This
paper provides a comprehensive review of IMC-based Quantized Neural Networks
(QNN) and links software-based quantization approaches to IMC hardware
implementation. Moreover, open challenges, QNN design requirements,
recommendations, and perspectives along with an IMC-based QNN hardware roadmap
are provided
A Construction Kit for Efficient Low Power Neural Network Accelerator Designs
Implementing embedded neural network processing at the edge requires
efficient hardware acceleration that couples high computational performance
with low power consumption. Driven by the rapid evolution of network
architectures and their algorithmic features, accelerator designs are
constantly updated and improved. To evaluate and compare hardware design
choices, designers can refer to a myriad of accelerator implementations in the
literature. Surveys provide an overview of these works but are often limited to
system-level and benchmark-specific performance metrics, making it difficult to
quantitatively compare the individual effect of each utilized optimization
technique. This complicates the evaluation of optimizations for new accelerator
designs, slowing-down the research progress. This work provides a survey of
neural network accelerator optimization approaches that have been used in
recent works and reports their individual effects on edge processing
performance. It presents the list of optimizations and their quantitative
effects as a construction kit, allowing to assess the design choices for each
building block separately. Reported optimizations range from up to 10'000x
memory savings to 33x energy reductions, providing chip designers an overview
of design choices for implementing efficient low power neural network
accelerators
Recommended from our members
Algorithm and Hardware Co-Design for Local/Edge Computing
Advances in VLSI manufacturing and design technology over the decades have created many computing paradigms for disparate computing needs. With concerns for transmission cost, security, latency of centralized computing, edge/local computing are increasingly prevalent in the faster growing sectors like Internet-of-Things (IoT) and other sectors that require energy/connectivity autonomous systems such as biomedical and industrial applications.
Energy and power efficient are the main design constraints in local and edge computing. While there exists a wide range of low power design techniques, they are often underutilized in custom circuit designs as the algorithms are developed independent of the hardware. Such compartmentalized design approach fails to take advantage of the many compatible algorithmic and hardware techniques that can improve the efficiency of the entire system. Algorithm hardware co-design is to explore the design space with whole stack awareness.
The main goal of the algorithm hardware co-design methodology is the enablement and improvement of small form factor edge and local VLSI systems operating under strict constraints of area and energy efficiency. This thesis presents selected works of application specific digital and mixed-signal integrated circuit designs. The application space ranges from implantable biomedical devices to edge machine learning acceleration
COMPUTE-IN-MEMORY WITH EMERGING NON-VOLATILE MEMORIES FOR ACCELERATING DEEP NEURAL NETWORKS
The objective of this research is to accelerate deep neural networks (DNNs) with emerging non-volatile memories (eNVMs) based compute-in-memory (CIM) architecture. The research first focuses on the inference acceleration and proposes a resistive random access memory (RRAM) based CIM architecture. Two generations of RRAM testchips which monolithically integrate the RRAM memory array and CMOS peripheral circuits are designed and fabricated using Winbond 90 nm and TSMC 40 nm commercial embedded RRAM process respectively. The first generation of testchip named XNOR-RRAM is dedicated for binary neural networks (BNNs) and the second generation named Flex-RRAM features 1bit-to-8bit run-time configurable precision and leverages the input sparsity of the DNN model to improve the throughput and energy efficiency. However, the non-ideal characteristics of eNVM devices, especially when utilized as multi-level analog synaptic weights, may incur a notable accuracy degradation for both training and inference. This research develops a PyTorch based framework that incorporates the device characteristics into the DNN model to evaluate the impact of the eNVM nonidealities on training/inference accuracy. The results suggest that it is challenging to directly use eNVMs for in-situ training and resistance drift remains as a critical challenge to maintain a high inference accuracy. Furthermore, to overcome the challenges posed by the asymmetric conductance tuning behavior of typical eNVMs, which is found to be the most critical nonideality that prevents the model from achieving software equivalent training accuracy, this research proposes a novel 2-transistor-1-FeFET (ferroelectric field effect transistor) based synaptic weight cell that exploits hybrid precision for in situ training and inference, which achieves near-software classification accuracy on MNIST and CIFAR-10 dataset.Ph.D
Recommended from our members
Low power circuit design techniques for edge computing
In the booming era of Internet-of-Things (IoT), the trend of pushing inference from cloud to edge due to concerns of latency, bandwidth, and privacy has created a demand for energy-efficient edge computing devices. The edge computing devices have been the critical building blocks in modern electronic systems, supporting various applications such as neural network inference, mobile healthcare monitoring, and human-machine interface. To improve the energy efficiency of edge devices, the author worked in three directions: 1) developing a ternary neural network accelerator achieving higher energy-efficiency than state-of-the-art binary neural network; 2) developing a 4-bit neural network accelerator with one-shot ADC conversion for the entire MAC array; 3) a long-term, real-time muscle fatigue detection device with ultrathin, ultrasoft, and long-term stable dry epidermal electrodes. In the first part, we propose a mixed-signal ternary CNN-based processor featuring higher energy efficiency than BNN. It confers several key improvements: 1) the proposed ternary network provides 1.5-b resolution (0/+1/-1), leading to 3.9x OPs/inference reduction than BNN for the same MNIST accuracy; 2) a 1.5b multiply-and-accumulate (MAC) is implemented by VCM-based capacitor switching scheme, which inherently benefits from the reduced signal swing on the capacitive DAC (CDAC); 3) the VCM-based MAC introduces sparsity during training, resulting in lower switching rate. With a complete neural network on chip, the proposed design realizes 97.1% MNIST accuracy with only 0.18ÎĽJ per classification, presenting the highest power efficiency for comparable MNIST accuracy. The second part of this dissertation focuses on a 4-bit MAC macro. This work proposes a mixed-signal MAC macro that requires only 1 ADC operation for the entire 512 4bĂ—4b MAC. This is achieved by mapping 9 partial products onto 5 wires based on their relative weights, dynamic buffering 5 wire voltages, and sampling them on properly sized SAR ADC capacitors. As a result, all MAC operations are finished in the charge domain by the end of the ADC sampling, allowing only 1 A/D conversion per multi-bit MAC. To further increase power efficiency, window-based comparison skipping and ReLU are embedded inside the SAR ADC, so that unnecessary comparison cycles are skipped for small or negative MAC outputs. Overall, despite using a 65nm process, the prototype chip achieves an energy efficiency of 164 TOPS/W for a 4-b MAC. Finally, this dissertation also presents a long-term, real-time muscle fatigue monitoring system consisting of 1) a hair-thin, skin-soft and mechanically robust e-tattoo electrode which is less susceptible to motion artifacts and capable of multi-day monitoring, 2) a battery-powered edge computing flexible printed circuit (FPC) which extracts instantaneous median frequency (IMDF) of surface electromyography (sEMG) bursts and wirelessly streams them to a mobile application. The system consumes an average of 33 mA current, supporting 25 hours of continuous operation, and could be extended into multiple days if only activated intermittently.Electrical and Computer Engineerin
CIRCUITS AND ARCHITECTURE FOR BIO-INSPIRED AI ACCELERATORS
Technological advances in microelectronics envisioned through Moore’s law have led to powerful processors that can handle complex and computationally intensive tasks. Nonetheless, these advancements through technology scaling have come at an unfavorable cost of significantly larger power consumption, which has posed challenges for data processing centers and computers at scale. Moreover, with the emergence of mobile computing platforms constrained by power and bandwidth for distributed computing, the necessity for more energy-efficient scalable local processing has become more significant. Unconventional Compute-in-Memory architectures such as the analog winner-takes-all associative-memory and the Charge-Injection Device processor have been proposed as alternatives.
Unconventional charge-based computation has been employed for neural network accelerators in the past, where impressive energy efficiency per operation has been attained in 1-bit vector-vector multiplications, and in recent work, multi-bit vector-vector multiplications. In the latter, computation was carried out by counting quanta of charge at the thermal noise limit, using packets of about 1000 electrons. These systems are neither analog nor digital in the traditional sense but employ mixed-signal circuits to count the packets of charge and hence we call them Quasi-Digital. By amortizing the energy costs of the mixed-signal encoding/decoding over compute-vectors with many elements, high energy efficiencies can be achieved.
In this dissertation, I present a design framework for AI accelerators using scalable compute-in-memory architectures. On the device level, two primitive elements are designed and characterized as target computational technologies: (i) a multilevel non-volatile cell and (ii) a pseudo Dynamic Random-Access Memory (pseudo-DRAM) bit-cell. At the level of circuit description, compute-in-memory crossbars and mixed-signal circuits were designed, allowing seamless connectivity to digital controllers. At the level of data representation, both binary and stochastic-unary coding are used to compute Vector-Vector Multiplications (VMMs) at the array level. Finally, on the architectural level, two AI accelerator for data-center processing and edge computing are discussed. Both designs are scalable multi-core Systems-on-Chip (SoCs), where vector-processor arrays are tiled on a 2-layer Network-on-Chip (NoC), enabling neighbor communication and flexible compute vs. memory trade-off. General purpose Arm/RISCV co-processors provide adequate bootstrapping and system-housekeeping and a high-speed interface fabric facilitates Input/Output to main memory