47 research outputs found

    Improving Performance and Endurance for Crossbar Resistive Memory

    Get PDF
    Resistive Memory (ReRAM) has emerged as a promising non-volatile memory technology that may replace a significant portion of DRAM in future computer systems. When adopting crossbar architecture, ReRAM cell can achieve the smallest theoretical size in fabrication, ideally for constructing dense memory with large capacity. However, crossbar cell structure suffers from severe performance and endurance degradations, which come from large voltage drops on long wires. In this dissertation, I first study the correlation between the ReRAM cell switching latency and the number of cells in low resistant state (LRS) along bitlines, and propose to dynamically speed up write operations based on bitline data patterns. By leveraging the intrinsic in-memory processing capability of ReRAM crossbars, a low overhead runtime profiler that effectively tracks the data patterns in different bitlines is proposed. To achieve further write latency reduction, data compression and row address dependent memory data layout are employed to reduce the numbers of LRS cells on bitlines. Moreover, two optimization techniques are presented to mitigate energy overhead brought by bitline data patterns tracking. Second, I propose XWL, a novel table-based wear leveling scheme for ReRAM crossbars and study the correlation between write endurance and voltage stress in ReRAM crossbars. By estimating and tracking the effective write stress to different rows at runtime, XWL chooses the ones that are stressed the most to mitigate. Additionally, two extended scenarios are further examined for the performance and endurance issues in neural network accelerators as well as 3D vertical ReRAM (3D-VRAM) arrays. For the ReRAM crossbar-based accelerators, by exploiting the wearing out mechanism of ReRAM cell, a novel comprehensive framework, ReNEW, is proposed to enhance the lifetime of the ReRAM crossbar-based accelerators, particularly for neural network training. To reduce the write latency in 3D-VRAM arrays, a collection of techniques, including an in-memory data encoding scheme, a data pattern estimator for assessing cell resistance distributions, and a write time reduction scheme that opportunistically reduces RESET latency with runtime data patterns, are devised

    A Phase Change Memory and DRAM Based Framework For Energy-Efficient and High-Speed In-Memory Stochastic Computing

    Get PDF
    Convolutional Neural Networks (CNNs) have proven to be highly effective in various fields related to Artificial Intelligence (AI) and Machine Learning (ML). However, the significant computational and memory requirements of CNNs make their processing highly compute and memory-intensive. In particular, the multiply-accumulate (MAC) operation, which is a fundamental building block of CNNs, requires enormous arithmetic operations. As the input dataset size increases, the traditional processor-centric von-Neumann computing architecture becomes ill-suited for CNN-based applications. This results in exponentially higher latency and energy costs, making the processing of CNNs highly challenging. To overcome these challenges, researchers have explored the Processing-In Memory (PIM) technique, which involves placing the processing unit inside or near the memory unit. This approach reduces data migration length and utilizes the internal memory bandwidth at the memory chip level. However, developing a reliable PIM-based system with minimal hardware modifications and design complexity remains a significant challenge. The proposed solution in the report suggests utilizing different memory technologies, such as Dynamic RAM (DRAM) and phase change memory (PCM), with Stochastic arithmetic and minimal add-on logic. Stochastic computing is a technique that uses random numbers to perform arithmetic operations instead of traditional binary representation. This technique reduces hardware requirements for CNN\u27s arithmetic operations, making it possible to implement them with minimal add-on logic. The report details the workflow for performing arithmetical operations used by CNNs, including MAC, activation, and floating-point functions. The proposed solution includes designs for scalable Stochastic Number Generator (SNG), DRAM CNN accelerator, non-volatile memory (NVM) class PCRAM-based CNN accelerator, and DRAM-based stochastic to binary conversion (StoB) for in-situ deep learning. These designs utilize stochastic computing to reduce the hardware requirements for CNN\u27s arithmetic operations and enable energy and time-efficient processing of CNNs. The report also identifies future research directions for the proposed designs, including in-situ PCRAM-based SNG, ODIN (A Bit-Parallel Stochastic Arithmetic Based Accelerator for In-Situ Neural Network Processing in Phase Change RAM), ATRIA (Bit-Parallel Stochastic Arithmetic Based Accelerator for In-DRAM CNN Processing), and AGNI (In-Situ, Iso-Latency Stochastic-to-Binary Number Conversion for In-DRAM Deep Learning), and presents initial findings for these ideas. In summary, the proposed solution in the report offers a comprehensive approach to address the challenges of processing CNNs, and the proposed designs have the potential to improve the energy and time efficiency of CNNs significantly. Using Stochastic Computing and different memory technologies enables the development of reliable PIM-based systems with minimal hardware modifications and design complexity, providing a promising path for the future of CNN-based applications

    ์—๋„ˆ์ง€ ํšจ์œจ์  ์ธ๊ณต์‹ ๊ฒฝ๋ง ์„ค๊ณ„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2019. 2. ์ตœ๊ธฐ์˜.์ตœ๊ทผ ์‹ฌ์ธต ํ•™์Šต์€ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜, ์Œ์„ฑ ์ธ์‹ ๋ฐ ๊ฐ•ํ™” ํ•™์Šต๊ณผ ๊ฐ™์€ ์˜์—ญ์—์„œ ๋†€๋ผ์šด ์„ฑ๊ณผ๋ฅผ ๊ฑฐ๋‘๊ณ  ์žˆ๋‹ค. ์ตœ์ฒจ๋‹จ ์‹ฌ์ธต ์ธ๊ณต์‹ ๊ฒฝ๋ง ์ค‘ ์ผ๋ถ€๋Š” ์ด๋ฏธ ์ธ๊ฐ„์˜ ๋Šฅ๋ ฅ์„ ๋„˜์–ด์„  ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ธ๊ณต์‹ ๊ฒฝ๋ง์€ ์—„์ฒญ๋‚œ ์ˆ˜์˜ ๊ณ ์ •๋ฐ€ ๊ณ„์‚ฐ๊ณผ ์ˆ˜๋ฐฑ๋งŒ๊ฐœ์˜ ๋งค๊ฐœ ๋ณ€์ˆ˜๋ฅผ ์ด์šฉํ•˜๊ธฐ ์œ„ํ•œ ๋นˆ๋ฒˆํ•œ ๋ฉ”๋ชจ๋ฆฌ ์•ก์„ธ์Šค๋ฅผ ์ˆ˜๋ฐ˜ํ•œ๋‹ค. ์ด๋Š” ์—„์ฒญ๋‚œ ์นฉ ๊ณต๊ฐ„๊ณผ ์—๋„ˆ์ง€ ์†Œ๋ชจ ๋ฌธ์ œ๋ฅผ ์•ผ๊ธฐํ•˜์—ฌ ์ž„๋ฒ ๋””๋“œ ์‹œ์Šคํ…œ์—์„œ ์ธ๊ณต์‹ ๊ฒฝ๋ง์ด ์‚ฌ์šฉ๋˜๋Š” ๊ฒƒ์„ ์ œํ•œํ•˜๊ฒŒ ๋œ๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ธ๊ณต์‹ ๊ฒฝ๋ง์„ ๋†’์€ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ์„ ๊ฐ–๋„๋ก ์„ค๊ณ„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ๋ฒˆ์งธ ํŒŒํŠธ์—์„œ๋Š” ๊ฐ€์ค‘ ์ŠคํŒŒ์ดํฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ์งง์€ ์ถ”๋ก  ์‹œ๊ฐ„๊ณผ ์ ์€ ์—๋„ˆ์ง€ ์†Œ๋ชจ์˜ ์žฅ์ ์„ ๊ฐ–๋Š” ์ŠคํŒŒ์ดํ‚น ์ธ๊ณต์‹ ๊ฒฝ๋ง ์„ค๊ณ„ ๋ฐฉ๋ฒ•์„ ๋‹ค๋ฃฌ๋‹ค. ์ŠคํŒŒ์ดํ‚น ์ธ๊ณต์‹ ๊ฒฝ๋ง์€ ์ธ๊ณต์‹ ๊ฒฝ๋ง์˜ ๋†’์€ ์—๋„ˆ์ง€ ์†Œ๋น„ ๋ฌธ์ œ๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•œ ์œ ๋งํ•œ ๋Œ€์•ˆ ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ๊ธฐ์กด ์—ฐ๊ตฌ์—์„œ ์‹ฌ์ธต ์ธ๊ณต์‹ ๊ฒฝ๋ง์„ ์ •ํ™•๋„ ์†์‹ค์—†์ด ์ŠคํŒŒ์ดํ‚น ์ธ๊ณต์‹ ๊ฒฝ๋ง์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๋ฐœํ‘œ๋˜์—ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•๋“ค์€ rate coding์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ธด ์ถ”๋ก  ์‹œ๊ฐ„์„ ๊ฐ–๊ฒŒ ๋˜๊ณ  ์ด๊ฒƒ์ด ๋งŽ์€ ์—๋„ˆ์ง€ ์†Œ๋ชจ๋ฅผ ์•ผ๊ธฐํ•˜๊ฒŒ ๋˜๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค. ์ด ํŒŒํŠธ์—์„œ๋Š” ํŽ˜์ด์ฆˆ์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ์ŠคํŒŒ์ดํฌ ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ์ถ”๋ก  ์‹œ๊ฐ„์„ ํฌ๊ฒŒ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. MNIST, SVHN, CIFAR-10, CIFAR-100 ๋ฐ์ดํ„ฐ์…‹์—์„œ์˜ ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•œ ์ŠคํŒŒ์ดํ‚น ์ธ๊ณต์‹ ๊ฒฝ๋ง์ด ๊ธฐ์กด ๋ฐฉ๋ฒ•์— ๋น„ํ•ด ํฐ ํญ์œผ๋กœ ์ถ”๋ก  ์‹œ๊ฐ„๊ณผ ์ŠคํŒŒ์ดํฌ ๋ฐœ์ƒ ๋นˆ๋„๋ฅผ ์ค„์—ฌ์„œ ๋ณด๋‹ค ์—๋„ˆ์ง€ ํšจ์œจ์ ์œผ๋กœ ๋™์ž‘ํ•จ์„ ๋ณด์—ฌ์ค€๋‹ค. ๋‘๋ฒˆ์งธ ํŒŒํŠธ์—์„œ๋Š” ๊ณต์ • ๋ณ€์ด๊ฐ€ ์žˆ๋Š” ์ƒํ™ฉ์—์„œ ๋™์ž‘ํ•˜๋Š” ๊ณ ์—๋„ˆ์ง€ํšจ์œจ ์•„๋‚ ๋กœ๊ทธ ์ธ๊ณต์‹ ๊ฒฝ๋ง ์„ค๊ณ„ ๋ฐฉ๋ฒ•์„ ๋‹ค๋ฃจ๊ณ  ์žˆ๋‹ค. ์ธ๊ณต์‹ ๊ฒฝ๋ง์„ ์•„๋‚ ๋กœ๊ทธ ํšŒ๋กœ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ตฌํ˜„ํ•˜๋ฉด ๋†’์€ ๋ณ‘๋ ฌ์„ฑ๊ณผ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ์„ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ์žฅ์ ์ด ์žˆ๋‹ค. ํ•˜์ง€๋งŒ, ์•„๋‚ ๋กœ๊ทธ ์‹œ์Šคํ…œ์€ ๋…ธ์ด์ฆˆ์— ์ทจ์•ฝํ•œ ์ค‘๋Œ€ํ•œ ๊ฒฐ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋…ธ์ด์ฆˆ ์ค‘ ํ•˜๋‚˜๋กœ ๊ณต์ • ๋ณ€์ด๋ฅผ ๋“ค ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด๋Š” ์•„๋‚ ๋กœ๊ทธ ํšŒ๋กœ์˜ ์ ์ • ๋™์ž‘ ์ง€์ ์„ ๋ณ€ํ™”์‹œ์ผœ ์‹ฌ๊ฐํ•œ ์„ฑ๋Šฅ ์ €ํ•˜ ๋˜๋Š” ์˜ค๋™์ž‘์„ ์œ ๋ฐœํ•˜๋Š” ์›์ธ์ด๋‹ค. ์ด ํŒŒํŠธ์—์„œ๋Š” ReRAM์— ๊ธฐ๋ฐ˜ํ•œ ๊ณ ์—๋„ˆ์ง€ ํšจ์œจ ์•„๋‚ ๋กœ๊ทธ ์ด์ง„ ์ธ๊ณต์‹ ๊ฒฝ๋ง์„ ๊ตฌํ˜„ํ•˜๊ณ , ๊ณต์ • ๋ณ€์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ํ™œ์„ฑ๋„ ์ผ์น˜ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ ๊ณต์ • ๋ณ€์ด ๋ณด์ƒ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ๋œ ์ธ๊ณต์‹ ๊ฒฝ๋ง์€ 1T1R ๊ตฌ์กฐ์˜ ReRAM ๋ฐฐ์—ด๊ณผ ์ฐจ๋™์ฆํญ๊ธฐ๋ฅผ ์ด์šฉํ•œ ๋‰ด๋Ÿฐ์„ ์ด์šฉํ•˜์—ฌ ๊ณ ๋ฐ€๋„ ์ง‘์ ๊ณผ ๊ณ ์—๋„ˆ์ง€ ํšจ์œจ ๋™์ž‘์ด ๊ฐ€๋Šฅํ•˜๊ฒŒ ๊ตฌ์„ฑ๋˜์—ˆ๋‹ค. ๋˜ํ•œ, ์•„๋‚ ๋กœ๊ทธ ๋‰ด๋Ÿฐ ํšŒ๋กœ์˜ ๊ณต์ • ๋ณ€์ด ์ทจ์•ฝ์„ฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ด์ƒ์ ์ธ ๋‰ด๋Ÿฐ์˜ ํ™œ์„ฑ๋„์™€ ๋™์ผํ•œ ํ™œ์„ฑ๋„๋ฅผ ๊ฐ–๋„๋ก ๋‰ด๋Ÿฐ์˜ ๋ฐ”์ด์–ด์Šค๋ฅผ ์กฐ์ ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค. ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ 32nm ๊ณต์ •์—์„œ ๊ตฌํ˜„๋œ ์ธ๊ณต์‹ ๊ฒฝ๋ง์€ 3-sigma ์ง€์ ์—์„œ 50% ๋ฌธํ„ฑ ์ „์•• ๋ณ€์ด์™€ 15%์˜ ์ €ํ•ญ๊ฐ’ ๋ณ€์ด๊ฐ€ ์žˆ๋Š” ์ƒํ™ฉ์—์„œ๋„ MNIST์—์„œ 98.55%, CIFAR-10์—์„œ 89.63%์˜ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€์œผ๋ฉฐ, 970 TOPS/W์— ๋‹ฌํ•˜๋Š” ๋งค์šฐ ๋†’์€ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค.Recently, deep learning has shown astounding performances on specific tasks such as image classification, speech recognition, and reinforcement learning. Some of the state-of-the-art deep neural networks have already gone over humans ability. However, neural networks involve tremendous number of high precision computations and frequent off-chip memory accesses with millions of parameters. It incurs problems of large area and exploding energy consumption, which hinder neural networks from being exploited in embedded systems. To cope with the problem, techniques for designing energy efficient neural networks are proposed. The first part of this dissertation addresses the design of spiking neural networks with weighted spikes which has advantages of shorter inference latency and smaller energy consumption compared to the conventional spiking neural networks. Spiking neural networks are being regarded as one of the promising alternative techniques to overcome the high energy costs of artificial neural networks. It is supported by many researches showing that a deep convolutional neural network can be converted into a spiking neural network with near zero accuracy loss. However, the advantage on energy consumption of spiking neural networks comes at a cost of long classification latency due to the use of Poisson-distributed spike trains (rate coding), especially in deep networks. We propose to use weighted spikes, which can greatly reduce the latency by assigning a different weight to a spike depending on which time phase it belongs. Experimental results on MNIST, SVHN, CIFAR-10, and CIFAR-100 show that the proposed spiking neural networks with weighted spikes achieve significant reduction in classification latency and number of spikes, which leads to faster and more energy-efficient spiking neural networks than the conventional spiking neural networks with rate coding. We also show that one of the state-of-the-art networks the deep residual network can be converted into spiking neural network without accuracy loss. The second part of this dissertation focuses on the design of highly energy-efficient analog neural networks in the presence of variations. Analog hardware accelerators for deep neural networks have taken center stage in the aspect of high parallelism and energy efficiency. However, a critical weakness of the analog hardware systems is vulnerability to noise. One of the biggest noise sources is a process variation. It is a big obstacle to using analog circuits since the variation shifts various parameters of analog circuits from the correct operating points, which causes severe performance degradation or even malfunction. To achieve high energy efficiency with analog neural networks, we propose resistive random access memory (ReRAM) based analog implementation of binarized neural networks (BNNs) with a novel variation compensation technique through activation matching (VCAM). The proposed architecture consists of 1-transistor-1-resistor (1T1R) structured ReRAM synaptic arrays and differential amplifier based neurons, which leads to high-density integration and energy efficiency. To cope with the vulnerability of analog neurons due to process variation, the biases of all neurons are adjusted in the direction that matches average output activation of ideal neurons without variation. The technique effectively restores the classification accuracy degraded by the variation. Experimental results on 32nm technology show that the proposed architecture achieves the classification accuracy of 98.55% on MNIST and 89.63% on CIFAR-10 in the presence of 50% threshold voltage variation and 15% resistance variation at 3-sigma point. It also achieves 970 TOPS/W energy efficiency with MLP on MNIST.1 Introduction 1 1.1 Deep Neural Networks with Weighted Spikes . . . . . . . . . . . . . 2 1.2 VCAM: Variation Compensation through Activation Matching for Analog Binarized Neural Networks . . . . . . . . . . . . . . . . . . . . . 5 2 Background 8 2.1 Spiking neural network . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Spiking neuron model . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Rate coding in SNNs . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Binarized neural networks . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 Resistive random access memory . . . . . . . . . . . . . . . . . . . . 18 3 RelatedWork 22 3.1 Training SNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 SNNs with various spike coding schemes . . . . . . . . . . . . . . . 25 3.3 BNN implementations . . . . . . . . . . . . . . . . . . . . . . . . . 28 4 Deep Neural Networks withWeighted Spikes 33 4.1 SNN with weighted spikes . . . . . . . . . . . . . . . . . . . . . . . 34 4.1.1 Weighted spikes . . . . . . . . . . . . . . . . . . . . . . . . 34 4.1.2 Spiking neuron model for weighted spikes . . . . . . . . . . . 35 4.1.3 Noise spike . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1.4 Approximation of the ReLU activation . . . . . . . . . . . . 39 4.1.5 ANN-to-SNN conversion . . . . . . . . . . . . . . . . . . . . 41 4.2 Optimization techniques . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.1 Skipping initial input currents in the output layer . . . . . . . 45 4.2.2 The number of phases in a period . . . . . . . . . . . . . . . 47 4.2.3 Accuracy-energy trade-off by early decision . . . . . . . . . . 50 4.2.4 Consideration on hardware implementation . . . . . . . . . . 52 4.3 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4.1 Comparison between SNN-RC and SNN-WS . . . . . . . . . 56 4.4.2 Trade-off by early decision . . . . . . . . . . . . . . . . . . . 64 4.4.3 Comparison with other algorithms . . . . . . . . . . . . . . . 67 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5 VCAM: Variation Compensation through Activation Matching for Analog Binarized Neural Networks 71 5.1 Modification of Binarized Neural Network . . . . . . . . . . . . . . . 72 5.1.1 Binarized Neural Network . . . . . . . . . . . . . . . . . . . 72 5.1.2 Use of 0 and 1 Activations . . . . . . . . . . . . . . . . . . . 72 5.1.3 Removal of Batch Normalization Layer . . . . . . . . . . . . 73 5.2 Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2.1 ReRAM Synaptic Array . . . . . . . . . . . . . . . . . . . . 75 5.2.2 Neuron Circuit . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.2.3 Issues with Neuron Circuit . . . . . . . . . . . . . . . . . . . 82 5.3 Variation Compensation . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.3.1 Variation Modeling . . . . . . . . . . . . . . . . . . . . . . . 85 5.3.2 Impact of VT Variation . . . . . . . . . . . . . . . . . . . . . 87 5.3.3 Variation Compensation Techniques . . . . . . . . . . . . . . 88 5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 93 5.4.2 Accuracy of the Modified BNN Algorithm . . . . . . . . . . 94 5.4.3 Variation Compensation . . . . . . . . . . . . . . . . . . . . 95 5.4.4 Performance Comparison . . . . . . . . . . . . . . . . . . . . 99 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6 Conclusion 102Docto

    Embracing Low-Power Systems with Improvement in Security and Energy-Efficiency

    Get PDF
    As the economies around the world are aligning more towards usage of computing systems, the global energy demand for computing is increasing rapidly. Additionally, the boom in AI based applications and services has already invited the pervasion of specialized computing hardware architectures for AI (accelerators). A big chunk of research in the industry and academia is being focused on providing energy efficiency to all kinds of power hungry computing architectures. This dissertation adds to these efforts. Aggressive voltage underscaling of chips is one the effective low power paradigms of providing energy efficiency. This dissertation identifies and deals with the reliability and performance problems associated with this paradigm and innovates novel energy efficient approaches. Specifically, the properties of a low power security primitive have been improved and, higher performance has been unlocked in an AI accelerator (Google TPU) in an aggressively voltage underscaled environment. And, novel power saving opportunities have been unlocked by characterizing the usage pattern of a baseline TPU with rigorous mathematical analysis

    Fast Fitting of the Dynamic Memdiode Model to the Conduction Characteristics of RRAM Devices Using Convolutional Neural Networks

    Get PDF
    In this paper, the use of Artificial Neural Networks (ANNs) in the form of Convolutional Neural Networks (AlexNET) for the fast and energy-efficient fitting of the Dynamic Memdiode Model (DMM) to the conduction characteristics of bipolar-type resistive switching (RS) devices is investigated. Despite an initial computationally intensive training phase the ANNs allow obtaining a mapping between the experimental Current-Voltage (I-V) curve and the corresponding DMM parameters without incurring a costly iterative process as typically considered in error minimization-based optimization algorithms. In order to demonstrate the fitting capabilities of the proposed approach, a complete set of I-Vs obtained from Yโ‚‚Oโ‚ƒ-based RRAM devices, fabricated with different oxidation conditions and measured with different current compliances, is considered. In this way, in addition to the intrinsic RS variability, extrinsic variation is achieved by means of external factors (oxygen content and damage control during the set process). We show that the reported method provides a significant reduction of the fitting time (one order of magnitude), especially in the case of large data sets. This issue is crucial when the extraction of the model parameters and their statistical characterization are required

    Ultra-low Power Circuits and Architectures for Neuromorphic Computing Accelerators with Emerging TFETs and ReRAMs

    Get PDF
    Neuromorphic computing using post-CMOS technologies is gaining increasing popularity due to its promising potential to resolve the power constraints in Von-Neumann machine and its similarity to the operation of the real human brain. To design the ultra-low voltage and ultra-low power analog-to-digital converters (ADCs) for the neuromorphic computing systems, we explore advantages of tunnel field effect transistor (TFET) analog-to-digital converters (ADCs) on energy efficiency and temperature stability. A fully-differential SAR ADC is designed using 20 nm TFET technology with doubled input swing and controlled comparator input common-mode voltage. To further increase the resolution of the ADC, we design an energy efficient 12-bit noise shaping (NS) successive-approximation register (SAR) ADC. The 2nd-order noise shaping architecture with multiple feed-forward paths is adopted and analyzed to optimize system design parameters. By utilizing tunnel field effect transistors (TFETs), the Delta-Sigma SAR is realized under an ultra-low supply voltage VDD with high energy efficiency. The stochastic neuron is a key for event-based probabilistic neural networks. We propose a stochastic neuron using a metal-oxide resistive random-access memory (ReRAM). The ReRAM\u27s conducting filament with built-in stochasticity is used to mimic the neuron\u27s membrane capacitor, which temporally integrates input spikes. A capacitor-less neuron circuit is designed, laid out, and simulated. The output spiking train of the neuron obeys the Poisson distribution. Based on the ReRAM based neuron, we propose a scalable and reconfigurable architecture that exploits the ReRAM-based neurons for deep Spiking Neural Networks (SNNs). In prior publications, neurons were implemented using dedicated analog or digital circuits that are not area and energy efficient. In our work, for the first time, we address the scaling and power bottlenecks of neuromorphic architecture by utilizing a single one-transistor-one-ReRAM (1T1R) cell to emulate the neuron. We show that the ReRAM-based neurons can be integrated within the synaptic crossbar to build extremely dense Process Element (PE)โ€“spiking neural network in memory arrayโ€“with high throughput. We provide microarchitecture and circuit designs to enable the deep spiking neural network computing in memory with an insignificant area overhead

    Reliability and Security of Compute-In-Memory Based Deep Neural Network Accelerators

    Get PDF
    Compute-In-Memory (CIM) is a promising solution for accelerating DNNs at edge devices, utilizing mixed-signal computations. However, it requires more cross-layer designs from algorithm levels to hardware implementations as it behaves differently from the pure digital system. On one side, the mixed-signal computations of CIM face unignorable variations, which could hamper the software performance. On the other side, there are potential software/hardware security vulnerabilities with CIM accelerators. This research aims to solve the reliability and security issues in CIM design for accelerating Deep Neural Network (DNN) algorithms as they prevent the real-life use of the CIM-based accelerators. Some non-ideal effects in CIM accelerators are explored, which could cause reliability issues, and solved by the software-hardware co-design methods. In addition, different security vulnerabilities for SRAM-based CIM and eNVM-based CIM inference engines are defined, and corresponding countermeasures are proposed.Ph.D

    Designing energy-efficient computing systems using equalization and machine learning

    Full text link
    As technology scaling slows down in the nanometer CMOS regime and mobile computing becomes more ubiquitous, designing energy-efficient hardware for mobile systems is becoming increasingly critical and challenging. Although various approaches like near-threshold computing (NTC), aggressive voltage scaling with shadow latches, etc. have been proposed to get the most out of limited battery life, there is still no โ€œsilver bulletโ€ to increasing power-performance demands of the mobile systems. Moreover, given that a mobile system could operate in a variety of environmental conditions, like different temperatures, have varying performance requirements, etc., there is a growing need for designing tunable/reconfigurable systems in order to achieve energy-efficient operation. In this work we propose to address the energy- efficiency problem of mobile systems using two different approaches: circuit tunability and distributed adaptive algorithms. Inspired by the communication systems, we developed feedback equalization based digital logic that changes the threshold of its gates based on the input pattern. We showed that feedback equalization in static complementary CMOS logic enabled up to 20% reduction in energy dissipation while maintaining the performance metrics. We also achieved 30% reduction in energy dissipation for pass-transistor digital logic (PTL) with equalization while maintaining performance. In addition, we proposed a mechanism that leverages feedback equalization techniques to achieve near optimal operation of static complementary CMOS logic blocks over the entire voltage range from near threshold supply voltage to nominal supply voltage. Using energy-delay product (EDP) as a metric we analyzed the use of the feedback equalizer as part of various sequential computational blocks. Our analysis shows that for near-threshold voltage operation, when equalization was used, we can improve the operating frequency by up to 30%, while the energy increase was less than 15%, with an overall EDP reduction of โ‰ˆ10%. We also observe an EDP reduction of close to 5% across entire above-threshold voltage range. On the distributed adaptive algorithm front, we explored energy-efficient hardware implementation of machine learning algorithms. We proposed an adaptive classifier that leverages the wide variability in data complexity to enable energy-efficient data classification operations for mobile systems. Our approach takes advantage of varying classification hardness across data to dynamically allocate resources and improve energy efficiency. On average, our adaptive classifier is โ‰ˆ100ร— more energy efficient but has โ‰ˆ1% higher error rate than a complex radial basis function classifier and is โ‰ˆ10ร— less energy efficient but has โ‰ˆ40% lower error rate than a simple linear classifier across a wide range of classification data sets. We also developed a field of groves (FoG) implementation of random forests (RF) that achieves an accuracy comparable to Convolutional Neural Networks (CNN) and Support Vector Machines (SVM) under tight energy budgets. The FoG architecture takes advantage of the fact that in random forests a small portion of the weak classifiers (decision trees) might be sufficient to achieve high statistical performance. By dividing the random forest into smaller forests (Groves), and conditionally executing the rest of the forest, FoG is able to achieve much higher energy efficiency levels for comparable error rates. We also take advantage of the distributed nature of the FoG to achieve high level of parallelism. Our evaluation shows that at maximum achievable accuracies FoG consumes โ‰ˆ1.48ร—, โ‰ˆ24ร—, โ‰ˆ2.5ร—, and โ‰ˆ34.7ร— lower energy per classification compared to conventional RF, SVM-RBF , Multi-Layer Perceptron Network (MLP), and CNN, respectively. FoG is 6.5ร— less energy efficient than SVM-LR, but achieves 18% higher accuracy on average across all considered datasets
    corecore