734 research outputs found
๊ทผ์ฌ ์ปดํจํ ์ ์ด์ฉํ ํ๋ก ๋ ธํ ๋ณด์๊ณผ ์๋์ง ํจ์จ์ ์ธ ์ ๊ฒฝ๋ง ๊ตฌํ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2020. 8. ์ดํ์ฌ.Approximate computing reduces the cost (energy and/or latency) of computations by relaxing the correctness (i.e., precision) of computations up to the level, which is dependent on types of applications. Moreover, it can be realized in various hierarchies of computing system design from circuit level to application level.
This dissertation presents the methodologies applying approximate computing across such hierarchies; compensating aging-induced delay in logic circuit by dynamic computation approximation (Chapter 1), designing energy-efficient neural network by combining low-power and low-latency approximate neuron models (Chapter 2), and co-designing in-memory gradient descent module with neural processing unit so as to address a memory bottleneck incurred by memory I/O for high-precision data (Chapter 3).
The first chapter of this dissertation presents a novel design methodology to turn the timing violation caused by aging into computation approximation error without the reliability guardband or increasing the supply voltage. It can be realized by accurately monitoring the critical path delay at run-time. The proposal is evaluated at two levels: RTL component level and system level. The experimental results at the RTL component level show a significant improvement in terms of (normalized) mean squared error caused by the timing violation and, at the system level, show that the proposed approach successfully transforms the aging-induced timing violation errors into much less harmful computation approximation errors, therefore it recovers image quality up to perceptually acceptable levels. It reduces the dynamic and static power consumption by 21.45% and 10.78%, respectively, with 0.8% area overhead compared to the conventional approach.
The second chapter of this dissertation presents an energy-efficient neural network consisting of alternative neuron models; Stochastic-Computing (SC) and Spiking (SP) neuron models. SC has been adopted in various fields to improve the power efficiency of systems by performing arithmetic computations stochastically, which approximates binary computation in conventional computing systems. Moreover, a recent work showed that deep neural network (DNN) can be implemented in the manner of stochastic computing and it greatly reduces power consumption. However, Stochastic DNN (SC-DNN) suffers from problem of high latency as it processes only a bit per cycle. To address such problem, it is proposed to adopt Spiking DNN (SP-DNN) as an input interface for SC-DNN since SP effectively processes more bits per cycle than SC-DNN. Moreover, this chapter resolves the encoding mismatch problem, between two different neuron models, without hardware cost by compensating the encoding mismatch with synapse weight calibration. A resultant hybrid DNN (SPSC-DNN) consists of SP-DNN as bottom layers and SC-DNN as top layers. Exploiting the reduced latency from SP-DNN and low-power consumption from SC-DNN, the proposed SPSC-DNN achieves improved energy-efficiency with lower error-rate compared to SC-DNN and SP-DNN in same network configuration.
The third chapter of this dissertation proposes GradPim architecture, which accelerates the parameter updates by in-memory processing which is codesigned with 8-bit floating-point training in Neural Processing Unit (NPU) for deep neural networks. By keeping the high precision processing algorithms in memory, such as the parameter update incorporating high-precision weights in its computation, the GradPim architecture can achieve high computational efficiency using 8-bit floating point in NPU and also gain power efficiency by eliminating massive high-precision data transfers between NPU and off-chip memory. A simple extension of DDR4 SDRAM utilizing bank-group parallelism makes the operation designs in processing-in-memory (PIM) module efficient in terms of hardware cost and performance. The experimental results show that the proposed architecture can improve the performance of the parameter update phase in the training by up to 40% and greatly reduce the memory bandwidth requirement while posing only a minimal amount of overhead to the protocol and the DRAM area.๊ทผ์ฌ ์ปดํจํ
์ ์ฐ์ฐ์ ์ ํ๋์ ์์ค์ ์ดํ๋ฆฌ์ผ์ด์
๋ณ ์ ์ ํ ์์ค๊น์ง ํ์ฉํจ์ผ๋ก์จ ์ฐ์ฐ์ ํ์ํ ๋น์ฉ (์๋์ง๋ ์ง์ฐ์๊ฐ)์ ์ค์ธ๋ค. ๊ฒ๋ค๊ฐ, ๊ทผ์ฌ ์ปดํจํ
์ ์ปดํจํ
์์คํ
์ค๊ณ์ ํ๋ก ๊ณ์ธต๋ถํฐ ์ดํ๋ฆฌ์ผ์ด์
๊ณ์ธต๊น์ง ๋ค์ํ ๊ณ์ธต์ ์ ์ฉ๋ ์ ์๋ค. ๋ณธ ๋
ผ๋ฌธ์์๋ ๊ทผ์ฌ ์ปดํจํ
๋ฐฉ๋ฒ๋ก ์ ๋ค์ํ ์์คํ
์ค๊ณ์ ๊ณ์ธต์ ์ ์ฉํ์ฌ ์ ๋ ฅ๊ณผ ์๋์ง ์ธก๋ฉด์์ ์ด๋์ ์ป์ ์ ์๋ ๋ฐฉ๋ฒ๋ค์ ์ ์ํ์๋ค. ์ด๋, ์ฐ์ฐ ๊ทผ์ฌํ (computation Approximation)๋ฅผ ํตํด ํ๋ก์ ๋
ธํ๋ก ์ธํด ์ฆ๊ฐ๋ ์ง์ฐ์๊ฐ์ ์ถ๊ฐ์ ์ธ ์ ๋ ฅ์๋ชจ ์์ด ๋ณด์ํ๋ ๋ฐฉ๋ฒ๊ณผ (์ฑํฐ 1), ๊ทผ์ฌ ๋ด๋ฐ๋ชจ๋ธ (approximate neuron model)์ ์ด์ฉํด ์๋์ง ํจ์จ์ด ๋์ ์ ๊ฒฝ๋ง์ ๊ตฌ์ฑํ๋ ๋ฐฉ๋ฒ (์ฑํฐ 2), ๊ทธ๋ฆฌ๊ณ ๋ฉ๋ชจ๋ฆฌ ๋์ญํญ์ผ๋ก ์ธํ ๋ณ๋ชฉํ์ ๋ฌธ์ ๋ฅผ ๋์ ์ ํ๋ ๋ฐ์ดํฐ๋ฅผ ํ์ฉํ ์ฐ์ฐ์ ๋ฉ๋ชจ๋ฆฌ ๋ด์์ ์ํํจ์ผ๋ก์จ ์ํ์ํค๋ ๋ฐฉ๋ฒ์ (์ฑํฐ3) ์ ์ํ์๋ค.
์ฒซ ๋ฒ์งธ ์ฑํฐ๋ ํ๋ก์ ๋
ธํ๋ก ์ธํ ์ง์ฐ์๊ฐ์๋ฐ์ (timing violation) ์ค๊ณ๋ง์ง์ด๋ (reliability guardband) ๊ณต๊ธ์ ๋ ฅ์ ์ฆ๊ฐ ์์ด ์ฐ์ฐ์ค์ฐจ (computation approximation error)๋ฅผ ํตํด ๋ณด์ํ๋ ์ค๊ณ๋ฐฉ๋ฒ๋ก (design methodology)๋ฅผ ์ ์ํ์๋ค. ์ด๋ฅผ ์ํด ์ฃผ์๊ฒฝ๋ก์ (critical path) ์ง์ฐ์๊ฐ์ ๋์์๊ฐ์ ์ ํํ๊ฒ ์ธก์ ํ ํ์๊ฐ ์๋ค. ์ฌ๊ธฐ์ ์ ์ํ๋ ๋ฐฉ๋ฒ๋ก ์ RTL component์ system ๋จ๊ณ์์ ํ๊ฐ๋์๋ค. RTL component ๋จ๊ณ์ ์คํ๊ฒฐ๊ณผ๋ฅผ ํตํด ์ ์ํ ๋ฐฉ์์ด ํ์คํ๋ ํ๊ท ์ ๊ณฑ์ค์ฐจ๋ฅผ (normalized mean squared error) ์๋นํ ์ค์์์ ๋ณผ ์ ์๋ค. ๊ทธ๋ฆฌ๊ณ system ๋จ๊ณ์์๋ ์ด๋ฏธ์ง์ฒ๋ฆฌ ์์คํ
์์ ์ด๋ฏธ์ง์ ํ์ง์ด ์ธ์ง์ ์ผ๋ก ์ถฉ๋ถํ ํ๋ณต๋๋ ๊ฒ์ ๋ณด์์ผ๋ก์จ ํ๋ก๋
ธํ๋ก ์ธํด ๋ฐ์ํ ์ง์ฐ์๊ฐ์๋ฐ ์ค์ฐจ๊ฐ ์๋ฌ์ ํฌ๊ธฐ๊ฐ ์์ ์ฐ์ฐ์ค์ฐจ๋ก ๋ณ๊ฒฝ๋๋ ๊ฒ์ ํ์ธ ํ ์ ์์๋ค. ๊ฒฐ๋ก ์ ์ผ๋ก, ์ ์๋ ๋ฐฉ๋ฒ๋ก ์ ๋ฐ๋์ ๋ 0.8%์ ๊ณต๊ฐ์ (area) ๋ ์ฌ์ฉํ๋ ๋น์ฉ์ ์ง๋ถํ๊ณ 21.45%์ ๋์ ์ ๋ ฅ์๋ชจ์ (dynamic power consumption) 10.78%์ ์ ์ ์ ๋ ฅ์๋ชจ์ (static power consumption) ๊ฐ์๋ฅผ ๋ฌ์ฑํ ์ ์์๋ค.
๋ ๋ฒ์งธ ์ฑํฐ๋ ๊ทผ์ฌ ๋ด๋ฐ๋ชจ๋ธ์ ํ์ฉํ๋ ๊ณ -์๋์งํจ์จ์ ์ ๊ฒฝ๋ง์ (neural network) ์ ์ํ์๋ค. ๋ณธ ๋
ผ๋ฌธ์์ ์ฌ์ฉํ ๋ ๊ฐ์ง์ ๊ทผ์ฌ ๋ด๋ฐ๋ชจ๋ธ์ ํ๋ฅ ์ปดํจํ
๊ณผ (stochastic computing) ์คํ์ดํน๋ด๋ฐ (spiking neuron) ์ด๋ก ๋ค์ ๊ธฐ๋ฐ์ผ๋ก ๋ชจ๋ธ๋ง๋์๋ค. ํ๋ฅ ์ปดํจํ
์ ์ฐ์ ์ฐ์ฐ๋ค์ ํ๋ฅ ์ ์ผ๋ก ์ํํจ์ผ๋ก์จ ์ด์ง์ฐ์ฐ์ ๋ฎ์ ์ ๋ ฅ์๋ชจ๋ก ์ํํ๋ค. ์ต๊ทผ์ ํ๋ฅ ์ปดํจํ
๋ด๋ฐ๋ชจ๋ธ์ ์ด์ฉํ์ฌ ์ฌ์ธต ์ ๊ฒฝ๋ง (deep neural network)๋ฅผ ๊ตฌํํ ์ ์๋ค๋ ์ฐ๊ตฌ๊ฐ ์งํ๋์๋ค. ๊ทธ๋ฌ๋, ํ๋ฅ ์ปดํจํ
์ ๋ด๋ฐ๋ชจ๋ธ๋ง์ ํ์ฉํ ๊ฒฝ์ฐ ์ฌ์ธต์ ๊ฒฝ๋ง์ด ๋งค ํด๋ฝ์ฌ์ดํด๋ง๋ค (clock cycle) ํ๋์ ๋นํธ๋ง์ (bit) ์ฒ๋ฆฌํ๋ฏ๋ก, ์ง์ฐ์๊ฐ ์ธก๋ฉด์์ ๋งค์ฐ ๋์ ์ ๋ฐ์ ์๋ ๋ฌธ์ ๊ฐ ์๋ค. ๋ฐ๋ผ์ ๋ณธ ๋
ผ๋ฌธ์์๋ ์ด๋ฌํ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํ์ฌ ์คํ์ดํน ๋ด๋ฐ๋ชจ๋ธ๋ก ๊ตฌ์ฑ๋ ์คํ์ดํน ์ฌ์ธต์ ๊ฒฝ๋ง์ ํ๋ฅ ์ปดํจํ
์ ํ์ฉํ ์ฌ์ธต์ ๊ฒฝ๋ง ๊ตฌ์กฐ์ ๊ฒฐํฉํ์๋ค. ์คํ์ดํน ๋ด๋ฐ๋ชจ๋ธ์ ๊ฒฝ์ฐ ๋งค ํด๋ฝ์ฌ์ดํด๋ง๋ค ์ฌ๋ฌ ๋นํธ๋ฅผ ์ฒ๋ฆฌํ ์ ์์ผ๋ฏ๋ก ์ฌ์ธต์ ๊ฒฝ๋ง์ ์
๋ ฅ ์ธํฐํ์ด์ค๋ก ์ฌ์ฉ๋ ๊ฒฝ์ฐ ์ง์ฐ์๊ฐ์ ์ค์ผ ์ ์๋ค. ํ์ง๋ง, ํ๋ฅ ์ปดํจํ
๋ด๋ฐ๋ชจ๋ธ๊ณผ ์คํ์ดํน ๋ด๋ฐ๋ชจ๋ธ์ ๊ฒฝ์ฐ ๋ถํธํ (encoding) ๋ฐฉ์์ด ๋ค๋ฅธ ๋ฌธ์ ๊ฐ ์๋ค. ๋ฐ๋ผ์ ๋ณธ ๋
ผ๋ฌธ์์๋ ํด๋น ๋ถํธํ ๋ถ์ผ์น ๋ฌธ์ ๋ฅผ ๋ชจ๋ธ์ ํ๋ผ๋ฏธํฐ๋ฅผ ํ์ตํ ๋ ๊ณ ๋ คํจ์ผ๋ก์จ, ํ๋ผ๋ฏธํฐ๋ค์ ๊ฐ์ด ๋ถํธํ ๋ถ์ผ์น๋ฅผ ๊ณ ๋ คํ์ฌ ์กฐ์ (calibration) ๋ ์ ์๋๋ก ํ์ฌ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ์๋ค. ์ด๋ฌํ ๋ถ์์ ๊ฒฐ๊ณผ๋ก, ์ ์ชฝ์๋ ์คํ์ดํน ์ฌ์ธต์ ๊ฒฝ๋ง์ ๋ฐฐ์นํ๊ณ ๋ท ์ชฝ์ ๋ ํ๋ฅ ์ปดํจํ
์ฌ์ธต์ ๊ฒฝ๋ง์ ๋ฐฐ์นํ๋ ํผ์ฑ์ ๊ฒฝ๋ง์ ์ ์ํ์๋ค. ํผ์ฑ์ ๊ฒฝ๋ง์ ์คํ์ดํน ์ฌ์ธต์ ๊ฒฝ๋ง์ ํตํด ๋งค ํด๋ฝ์ฌ์ดํด๋ง๋ค ์ฒ๋ฆฌ๋๋ ๋นํธ ์์ ์ฆ๊ฐ๋ก ์ธํ ์ง์ฐ์๊ฐ ๊ฐ์ ํจ๊ณผ์ ํ๋ฅ ์ปดํจํ
์ฌ์ธต์ ๊ฒฝ๋ง์ ์ ์ ๋ ฅ ์๋ชจ ํน์ฑ์ ๋ชจ๋ ํ์ฉํจ์ผ๋ก์จ ๊ฐ ์ฌ์ธต์ ๊ฒฝ๋ง์ ๋ฐ๋ก ์ฌ์ฉํ๋ ๊ฒฝ์ฐ ๋๋น ์ฐ์ํ ์๋์ง ํจ์จ์ฑ์ ๋น์ทํ๊ฑฐ๋ ๋ ๋์ ์ ํ๋ ๊ฒฐ๊ณผ๋ฅผ ๋ด๋ฉด์ ๋ฌ์ฑํ๋ค.
์ธ ๋ฒ์งธ ์ฑํฐ๋ ์ฌ์ธต์ ๊ฒฝ๋ง์ 8๋นํธ ๋ถ๋์์ซ์ ์ฐ์ฐ์ผ๋ก ํ์ตํ๋ ์ ๊ฒฝ๋ง์ฒ๋ฆฌ์ ๋์ (neural processing unit) ํ๋ผ๋ฏธํฐ ๊ฐฑ์ ์ (parameter update) ๋ฉ๋ชจ๋ฆฌ-๋ด-์ฐ์ฐ์ผ๋ก (in-memory processing) ๊ฐ์ํ๋ GradPIM ์ํคํ
์ณ๋ฅผ ์ ์ํ์๋ค. GradPIM์ 8๋นํธ์ ๋ฎ์ ์ ํ๋ ์ฐ์ฐ์ ์ ๊ฒฝ๋ง์ฒ๋ฆฌ์ ๋์ ๋จ๊ธฐ๊ณ , ๋์ ์ ํ๋๋ฅผ ๊ฐ์ง๋ ๋ฐ์ดํฐ๋ฅผ ํ์ฉํ๋ ์ฐ์ฐ์ (ํ๋ผ๋ฏธํฐ ๊ฐฑ์ ) ๋ฉ๋ชจ๋ฆฌ ๋ด๋ถ์ ๋ ์ผ๋ก์จ ์ ๊ฒฝ๋ง์ฒ๋ฆฌ์ ๋๊ณผ ๋ฉ๋ชจ๋ฆฌ๊ฐ์ ๋ฐ์ดํฐํต์ ์ ์์ ์ค์ฌ, ๋์ ์ฐ์ฐํจ์จ๊ณผ ์ ๋ ฅํจ์จ์ ๋ฌ์ฑํ์๋ค. ๋ํ, GradPIM์ bank-group ์์ค์ ๋ณ๋ ฌํ๋ฅผ ์ด๋ฃจ์ด ๋ด ๋์ ๋ด๋ถ ๋์ญํญ์ ํ์ฉํจ์ผ๋ก์จ ๋ฉ๋ชจ๋ฆฌ ๋์ญํญ์ ํฌ๊ฒ ํ์ฅ์ํฌ ์ ์๊ฒ ๋์๋ค. ๋ํ ์ด๋ฌํ ๋ฉ๋ชจ๋ฆฌ ๊ตฌ์กฐ์ ๋ณ๊ฒฝ์ด ์ต์ํ๋์๊ธฐ ๋๋ฌธ์ ์ถ๊ฐ์ ์ธ ํ๋์จ์ด ๋น์ฉ๋ ์ต์ํ๋์๋ค. ์คํ ๊ฒฐ๊ณผ๋ฅผ ํตํด GradPIM์ด ์ต์ํ์ DRAM ํ๋กํ ์ฝ ๋ณํ์ DRAM์นฉ ๋ด์ ๊ณต๊ฐ์ฌ์ฉ์ ํตํด ์ฌ์ธต์ ๊ฒฝ๋ง ํ์ต๊ณผ์ ์ค ํ๋ผ๋ฏธํฐ ๊ฐฑ์ ์ ํ์ํ ์๊ฐ์ 40%๋งํผ ํฅ์์์ผฐ์์ ๋ณด์๋ค.Chapter I: Dynamic Computation Approximation for Aging Compensation 1
1.1 Introduction 1
1.1.1 Chip Reliability 1
1.1.2 Reliability Guardband 2
1.1.3 Approximate Computing in Logic Circuits 2
1.1.4 Computation approximation for Aging Compensation 3
1.1.5 Motivational Case Study 4
1.2 Previous Work 5
1.2.1 Aging-induced Delay 5
1.2.2 Delay-Configurable Circuits 6
1.3 Proposed System 8
1.3.1 Overview of the Proposed System 8
1.3.2 Proposed Adder 9
1.3.3 Proposed Multiplier 11
1.3.4 Proposed Monitoring Circuit 16
1.3.5 Aging Compensation Scheme 19
1.4 Design Methodology 20
1.5 Evaluation 24
1.5.1 Experimental setup 24
1.5.2 RTL component level Adder/Multiplier 27
1.5.3 RTL component level Monitoring circuit 30
1.5.4 System level 31
1.6 Summary 38
Chapter II: Energy-Efficient Neural Network by Combining Approximate Neuron Models 40
2.1 Introduction 40
2.1.1 Deep Neural Network (DNN) 40
2.1.2 Low-power designs for DNN 41
2.1.3 Stochastic-Computing Deep Neural Network 41
2.1.4 Spiking Deep Neural Network 43
2.2 Hybrid of Stochastic and Spiking DNNs 44
2.2.1 Stochastic-Computing vs Spiking Deep Neural Network 44
2.2.2 Combining Spiking Layers and Stochastic Layers 46
2.2.3 Encoding Mismatch 47
2.3 Evaluation 49
2.3.1 Latency and Test Error 49
2.3.2 Energy Efficiency 51
2.4 Summary 54
Chapter III: GradPIM: In-memory Gradient Descent in Mixed-Precision DNN Training 55
3.1 Introduction 55
3.1.1 Neural Processing Unit 55
3.1.2 Mixed-precision Training 56
3.1.3 Mixed-precision Training with In-memory Gradient Descent 57
3.1.4 DNN Parameter Update Algorithms 59
3.1.5 Modern DRAM Architecture 61
3.1.6 Motivation 63
3.2 Previous Work 65
3.2.1 Processing-In-Memory 65
3.2.2 Co-design Neural Processing Unit and Processing-In-Memory 66
3.2.3 Low-precision Computation in NPU 67
3.3 GradPIM 68
3.3.1 GradPIM Architecture 68
3.3.2 GradPIM Operations 69
3.3.3 Timing Considerations 70
3.3.4 Update Phase Procedure 73
3.3.5 Commanding GradPIM 75
3.4 NPU Co-design with GradPIM 76
3.4.1 NPU Architecture 76
3.4.2 Data Placement 79
3.5 Evaluation 82
3.5.1 Evaluation Methodology 82
3.5.2 Experimental Results 83
3.5.3 Sensitivity Analysis 88
3.5.4 Layer Characterizations 90
3.5.5 Distributed Data Parallelism 90
3.6 Summary 92
3.6.1 Discussion 92
Bibliography 113
์์ฝ 114Docto
Approximate Computing Survey, Part I: Terminology and Software & Hardware Approximation Techniques
The rapid growth of demanding applications in domains applying multimedia
processing and machine learning has marked a new era for edge and cloud
computing. These applications involve massive data and compute-intensive tasks,
and thus, typical computing paradigms in embedded systems and data centers are
stressed to meet the worldwide demand for high performance. Concurrently, the
landscape of the semiconductor field in the last 15 years has constituted power
as a first-class design concern. As a result, the community of computing
systems is forced to find alternative design approaches to facilitate
high-performance and/or power-efficient computing. Among the examined
solutions, Approximate Computing has attracted an ever-increasing interest,
with research works applying approximations across the entire traditional
computing stack, i.e., at software, hardware, and architectural levels. Over
the last decade, there is a plethora of approximation techniques in software
(programs, frameworks, compilers, runtimes, languages), hardware (circuits,
accelerators), and architectures (processors, memories). The current article is
Part I of our comprehensive survey on Approximate Computing, and it reviews its
motivation, terminology and principles, as well it classifies and presents the
technical details of the state-of-the-art software and hardware approximation
techniques.Comment: Under Review at ACM Computing Survey
Low Power Processor Architectures and Contemporary Techniques for Power Optimization โ A Review
The technological evolution has increased the number of transistors for a given die area significantly and increased the switching speed from few MHz to GHz range. Such inversely proportional decline in size and boost in performance consequently demands shrinking of supply voltage and effective power dissipation in chips with millions of transistors. This has triggered substantial amount of research in power reduction techniques into almost every aspect of the chip and particularly the processor cores contained in the chip. This paper presents an overview of techniques for achieving the power efficiency mainly at the processor core level but also visits related domains such as buses and memories. There are various processor parameters and features such as supply voltage, clock frequency, cache and pipelining which can be optimized to reduce the power consumption of the processor. This paper discusses various ways in which these parameters can be optimized. Also, emerging power efficient processor architectures are overviewed and research activities are discussed which should help reader identify how these factors in a processor contribute to power consumption. Some of these concepts have been already established whereas others are still active research areas. ยฉ 2009 ACADEMY PUBLISHER
Energy Efficient Hardware Design of Neural Networks
abstract: Hardware implementation of deep neural networks is earning significant importance nowadays. Deep neural networks are mathematical models that use learning algorithms inspired by the brain. Numerous deep learning algorithms such as multi-layer perceptrons (MLP) have demonstrated human-level recognition accuracy in image and speech classification tasks. Multiple layers of processing elements called neurons with several connections between them called synapses are used to build these networks. Hence, it involves operations that exhibit a high level of parallelism making it computationally and memory intensive. Constrained by computing resources and memory, most of the applications require a neural network which utilizes less energy. Energy efficient implementation of these computationally intense algorithms on neuromorphic hardware demands a lot of architectural optimizations. One of these optimizations would be the reduction in the network size using compression and several studies investigated compression by introducing element-wise or row-/column-/block-wise sparsity via pruning and regularization. Additionally, numerous recent works have concentrated on reducing the precision of activations and weights with some reducing to a single bit. However, combining various sparsity structures with binarized or very-low-precision (2-3 bit) neural networks have not been comprehensively explored. Output activations in these deep neural network algorithms are habitually non-binary making it difficult to exploit sparsity. On the other hand, biologically realistic models like spiking neural networks (SNN) closely mimic the operations in biological nervous systems and explore new avenues for brain-like cognitive computing. These networks deal with binary spikes, and they can exploit the input-dependent sparsity or redundancy to dynamically scale the amount of computation in turn leading to energy-efficient hardware implementation. This work discusses configurable spiking neuromorphic architecture that supports multiple hidden layers exploiting hardware reuse. It also presents design techniques for minimum-area/-energy DNN hardware with minimal degradation in accuracy. Area, performance and energy results of these DNN and SNN hardware is reported for the MNIST dataset. The Neuromorphic hardware designed for SNN algorithm in 28nm CMOS demonstrates high classification accuracy (>98% on MNIST) and low energy (51.4 - 773 (nJ) per classification). The optimized DNN hardware designed in 40nm CMOS that combines 8X structured compression and 3-bit weight precision showed 98.4% accuracy at 33 (nJ) per classification.Dissertation/ThesisMasters Thesis Electrical Engineering 201
Design Techniques for Energy-Quality Scalable Digital Systems
Energy efficiency is one of the key design goals in modern computing. Increasingly complex tasks are being executed in mobile devices and Internet of Things end-nodes, which are expected to operate for long time intervals, in the orders of months or years, with the limited energy budgets provided by small form-factor batteries. Fortunately, many of such tasks are error resilient, meaning that they can toler- ate some relaxation in the accuracy, precision or reliability of internal operations, without a significant impact on the overall output quality. The error resilience of an application may derive from a number of factors. The processing of analog sensor inputs measuring quantities from the physical world may not always require maximum precision, as the amount of information that can be extracted is limited by the presence of external noise. Outputs destined for human consumption may also contain small or occasional errors, thanks to the limited capabilities of our vision and hearing systems. Finally, some computational patterns commonly found in domains such as statistics, machine learning and operational research, naturally tend to reduce or eliminate errors. Energy-Quality (EQ) scalable digital systems systematically trade off the quality of computations with energy efficiency, by relaxing the precision, the accuracy, or the reliability of internal software and hardware components in exchange for energy reductions. This design paradigm is believed to offer one of the most promising solutions to the impelling need for low-energy computing. Despite these high expectations, the current state-of-the-art in EQ scalable design suffers from important shortcomings. First, the great majority of techniques proposed in literature focus only on processing hardware and software components. Nonetheless, for many real devices, processing contributes only to a small portion of the total energy consumption, which is dominated by other components (e.g. I/O, memory or data transfers). Second, in order to fulfill its promises and become diffused in commercial devices, EQ scalable design needs to achieve industrial level maturity. This involves moving from purely academic research based on high-level models and theoretical assumptions to engineered flows compatible with existing industry standards. Third, the time-varying nature of error tolerance, both among different applications and within a single task, should become more central in the proposed design methods. This involves designing โdynamicโ systems in which the precision or reliability of operations (and consequently their energy consumption) can be dynamically tuned at runtime, rather than โstaticโ solutions, in which the output quality is fixed at design-time. This thesis introduces several new EQ scalable design techniques for digital systems that take the previous observations into account. Besides processing, the proposed methods apply the principles of EQ scalable design also to interconnects and peripherals, which are often relevant contributors to the total energy in sensor nodes and mobile systems respectively. Regardless of the target component, the presented techniques pay special attention to the accurate evaluation of benefits and overheads deriving from EQ scalability, using industrial-level models, and on the integration with existing standard tools and protocols. Moreover, all the works presented in this thesis allow the dynamic reconfiguration of output quality and energy consumption. More specifically, the contribution of this thesis is divided in three parts. In a first body of work, the design of EQ scalable modules for processing hardware data paths is considered. Three design flows are presented, targeting different technologies and exploiting different ways to achieve EQ scalability, i.e. timing-induced errors and precision reduction. These works are inspired by previous approaches from the literature, namely Reduced-Precision Redundancy and Dynamic Accuracy Scaling, which are re-thought to make them compatible with standard Electronic Design Automation (EDA) tools and flows, providing solutions to overcome their main limitations. The second part of the thesis investigates the application of EQ scalable design to serial interconnects, which are the de facto standard for data exchanges between processing hardware and sensors. In this context, two novel bus encodings are proposed, called Approximate Differential Encoding and Serial-T0, that exploit the statistical characteristics of data produced by sensors to reduce the energy consumption on the bus at the cost of controlled data approximations. The two techniques achieve different results for data of different origins, but share the common features of allowing runtime reconfiguration of the allowed error and being compatible with standard serial bus protocols. Finally, the last part of the manuscript is devoted to the application of EQ scalable design principles to displays, which are often among the most energy- hungry components in mobile systems. The two proposals in this context leverage the emissive nature of Organic Light-Emitting Diode (OLED) displays to save energy by altering the displayed image, thus inducing an output quality reduction that depends on the amount of such alteration. The first technique implements an image-adaptive form of brightness scaling, whose outputs are optimized in terms of balance between power consumption and similarity with the input. The second approach achieves concurrent power reduction and image enhancement, by means of an adaptive polynomial transformation. Both solutions focus on minimizing the overheads associated with a real-time implementation of the transformations in software or hardware, so that these do not offset the savings in the display. For each of these three topics, results show that the aforementioned goal of building EQ scalable systems compatible with existing best practices and mature for being integrated in commercial devices can be effectively achieved. Moreover, they also show that very simple and similar principles can be applied to design EQ scalable versions of different system components (processing, peripherals and I/O), and to equip these components with knobs for the runtime reconfiguration of the energy versus quality tradeoff
Vector support for multicore processors with major emphasis on configurable multiprocessors
It recently became increasingly difficult to build higher speed uniprocessor chips because of performance degradation and high power consumption. The quadratically increasing circuit complexity forbade the exploration of more instruction-level parallelism (JLP). To continue raising the performance, processor designers then focused on thread-level parallelism (TLP) to realize a new architecture design paradigm. Multicore processor design is the result of this trend. It has proven quite capable in performance increase and provides new opportunities in power management and system scalability. But current multicore processors do not provide powerful vector architecture support which could yield significant speedups for array operations while maintaining arealpower efficiency.
This dissertation proposes and presents the realization of an FPGA-based prototype of a multicore architecture with a shared vector unit (MCwSV). FPGA stands for Filed-Programmable Gate Array. The idea is that rather than improving only scalar or TLP performance, some hardware budget could be used to realize a vector unit to greatly speedup applications abundant in data-level parallelism (DLP). To be realistic, limited by the parallelism in the application itself and by the compiler\u27s vectorizing abilities, most of the general-purpose programs can only be partially vectorized. Thus, for efficient resource usage, one vector unit should be shared by several scalar processors. This approach could also keep the overall budget within acceptable limits. We suggest that this type of vector-unit sharing be established in future multicore chips.
The design, implementation and evaluation of an MCwSV system with two scalar processors and a shared vector unit are presented for FPGA prototyping. The MicroBlaze processor, which is a commercial IP (Intellectual Property) core from Xilinx, is used as the scalar processor; in the experiments the vector unit is connected to a pair of MicroBlaze processors through standard bus interfaces. The overall system is organized in a decoupled and multi-banked structure. This organization provides substantial system scalability and better vector performance. For a given area budget, benchmarks from several areas show that the MCwSV system can provide significant performance increase as compared to a multicore system without a vector unit.
However, a MCwSV system with two MicroBlazes and a shared vector unit is not always an optimized system configuration for various applications with different percentages of vectorization. On the other hand, the MCwSV framework was designed for easy scalability to potentially incorporate various numbers of scalar/vector units and various function units. Also, the flexibility inherent to FPGAs can aid the task of matching target applications. These benefits can be taken into account to create optimized MCwSV systems for various applications. So the work eventually focused on building an architecture design framework incorporating performance and resource management for application-specific MCwSV (AS-MCwSV) systems. For embedded system design, resource usage, power consumption and execution latency are three metrics to be used in design tradeoffs. The product of these metrics is used here to choose the MCwSV system with the smallest value
Vector processing-aware advanced clock-gating techniques for low-power fused multiply-add
The need for power efficiency is driving a rethink of design decisions in processor architectures. While vector processors succeeded in the high-performance market in the past, they need a retailoring for the mobile market that they are entering now. Floating-point (FP) fused multiply-add (FMA), being a functional unit with high power consumption, deserves special attention. Although clock gating is a well-known method to reduce switching power in synchronous designs, there are unexplored opportunities for its application to vector processors, especially when considering active operating mode. In this research, we comprehensively identify, propose, and evaluate the most suitable clock-gating techniques for vector FMA units (VFUs). These techniques ensure power savings without jeopardizing the timing. We evaluate the proposed techniques using both synthetic and โreal-worldโ application-based benchmarking. Using vector masking and vector multilane-aware clock gating, we report power reductions of up to 52%, assuming active VFU operating at the peak performance. Among other findings, we observe that vector instruction-based clock-gating techniques achieve power savings for all vector FP instructions. Finally, when evaluating all techniques together, using โreal-worldโ benchmarking, the power reductions are up to 80%. Additionally, in accordance with processor design trends, we perform this research in a fully parameterizable and automated fashion.The research leading to these results has received funding from the RoMoL ERC Advanced Grant GA 321253 and is supported in part by the European Union (FEDER funds) under contract TTIN2015-65316-P.
The work of I. Ratkovic was supported by a FPU research grant from the Spanish MECD.Peer ReviewedPostprint (author's final draft
Approximate Computing Survey, Part II: Application-Specific & Architectural Approximation Techniques and Applications
The challenging deployment of compute-intensive applications from domains
such Artificial Intelligence (AI) and Digital Signal Processing (DSP), forces
the community of computing systems to explore new design approaches.
Approximate Computing appears as an emerging solution, allowing to tune the
quality of results in the design of a system in order to improve the energy
efficiency and/or performance. This radical paradigm shift has attracted
interest from both academia and industry, resulting in significant research on
approximation techniques and methodologies at different design layers (from
system down to integrated circuits). Motivated by the wide appeal of
Approximate Computing over the last 10 years, we conduct a two-part survey to
cover key aspects (e.g., terminology and applications) and review the
state-of-the art approximation techniques from all layers of the traditional
computing stack. In Part II of our survey, we classify and present the
technical details of application-specific and architectural approximation
techniques, which both target the design of resource-efficient
processors/accelerators & systems. Moreover, we present a detailed analysis of
the application spectrum of Approximate Computing and discuss open challenges
and future directions.Comment: Under Review at ACM Computing Survey
- โฆ