135 research outputs found
์ด์ข ์์ฐ์ด ์ฒ๋ฆฌ ๋ชจ๋ธ์ ์ํ ํ์ฅํ ์ปดํจํฐ ์์คํ ์ค๊ณ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2021. 2. ๊น์ฅ์ฐ.Modern neural-network (NN) accelerators have been successful by accelerating a small number of basic operations (e.g., convolution, fully-connected, feedback) comprising the specific target neural-network models (e.g., CNN, RNN). However, this approach no longer works for the emerging full-scale natural language processing (NLP)-based neural network models (e.g., Memory networks, Transformer, BERT), which consist of different combinations of complex and heterogeneous operations (e.g., self-attention, multi-head attention, large-scale feed-forward). Existing acceleration proposals cover only the proposal-specific basic operations and/or customize them for specific models only, which leads to the low performance improvement and the narrow model coverage. Therefore, an ideal NLP accelerator should first identify all performance-critical operations required by different NLP models and support them as a single accelerator to achieve a high model coverage, and can adaptively optimize its architecture to achieve the best performance for the given model.
To address these scalability and model/config diversity issues, the dissertation introduces two novel projects (i.e., MnnFast and NLP-Fast) to efficiently accelerate a wide spectrum of full-scale NLP models. First, MnnFast proposes three novel optimizations to resolve three major performance problems (i.e., high memory bandwidth, heavy computation, and cache contention) in memory-augmented neural networks. Next, NLP-Fast adopts three optimization techniques to resolve the huge performance variation due to the model/config diversity in emerging NLP models. We implement both MnnFast and NLP-Fast on different hardware platforms (i.e., CPU, GPU, FPGA) and thoroughly evaluate their performance improvement on each platform.์์ฐ์ด ์ฒ๋ฆฌ์ ์ค์์ฑ์ด ๋๋๋จ์ ๋ฐ๋ผ ์ฌ๋ฌ ๊ธฐ์
๋ฐ ์ฐ๊ตฌ์ง๋ค์ ๋ค์ํ๊ณ ๋ณต์กํ ์ข
๋ฅ์ ์์ฐ์ด ์ฒ๋ฆฌ ๋ชจ๋ธ๋ค์ ์ ์ํ๊ณ ์๋ค. ์ฆ ์์ฐ์ด ์ฒ๋ฆฌ ๋ชจ๋ธ๋ค์ ํํ๊ฐ ๋ณต์กํด์ง๊ณ ,๋ก๊ท๋ชจ๊ฐ ์ปค์ง๋ฉฐ, ์ข
๋ฅ๊ฐ ๋ค์ํด์ง๋ ์์์ ๋ณด์ฌ์ค๋ค. ๋ณธ ํ์๋
ผ๋ฌธ์ ์ด๋ฌํ ์์ฐ์ด ์ฒ๋ฆฌ ๋ชจ๋ธ์ ๋ณต์ก์ฑ, ํ์ฅ์ฑ, ๋ค์์ฑ์ ํด๊ฒฐํ๊ธฐ ์ํด ์ฌ๋ฌ ํต์ฌ ์์ด๋์ด๋ฅผ ์ ์ํ์๋ค. ๊ฐ๊ฐ์ ํต์ฌ ์์ด๋์ด๋ค์ ๋ค์๊ณผ ๊ฐ๋ค. (1) ๋ค์ํ ์ข
๋ฅ์ ์์ฐ์ด ์ฒ๋ฆฌ ๋ชจ๋ธ์ ์ฑ๋ฅ ์ค๋ฒํค๋ ๋ถํฌ๋๋ฅผ ์์๋ด๊ธฐ ์ํ ์ ์ /๋์ ๋ถ์์ ์ํํ๋ค. (2) ์ฑ๋ฅ ๋ถ์์ ํตํด ์์๋ธ ์ฃผ๋ ์ฑ๋ฅ ๋ณ๋ชฉ ์์๋ค์ ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ์ ์ต์ ํ ํ๊ธฐ ์ํ ์ ์ฒด๋ก ์ ๋ชจ๋ธ ๋ณ๋ ฌํ ๊ธฐ์ ์ ์ ์ํ๋ค. (3) ์ฌ๋ฌ ์ฐ์ฐ๋ค์ ์ฐ์ฐ๋์ ๊ฐ์ํ๋ ๊ธฐ์ ๊ณผ ์ฐ์ฐ๋ ๊ฐ์๋ก ์ธํ skewness ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํ dynamic scheduler ๊ธฐ์ ์ ์ ์ํ๋ค. (4) ํ ์์ฐ์ด ์ฒ๋ฆฌ ๋ชจ๋ธ์ ์ฑ๋ฅ ๋ค์์ฑ์ ํด๊ฒฐํ๊ธฐ ์ํด ๊ฐ ๋ชจ๋ธ์ ์ต์ ํ๋ ๋์์ธ์ ์ ์ํ๋ ๊ธฐ์ ์ ์ ์ํ๋ค. ์ด๋ฌํ ํต์ฌ ๊ธฐ์ ๋ค์ ์ฌ๋ฌ ์ข
๋ฅ์ ํ๋์จ์ด ๊ฐ์๊ธฐ (์: CPU, GPU, FPGA, ASIC) ์๋ ๋ฒ์ฉ์ ์ผ๋ก ์ฌ์ฉ๋ ์ ์๊ธฐ ๋๋ฌธ์ ๋งค์ฐ ํจ๊ณผ์ ์ด๋ฏ๋ก, ์ ์๋ ๊ธฐ์ ๋ค์ ์์ฐ์ด ์ฒ๋ฆฌ ๋ชจ๋ธ์ ์ํ ์ปดํจํฐ ์์คํ
์ค๊ณ ๋ถ์ผ์ ๊ด๋ฒ์ํ๊ฒ ์ ์ฉ๋ ์ ์๋ค. ๋ณธ ๋
ผ๋ฌธ์์๋ ํด๋น ๊ธฐ์ ๋ค์ ์ ์ฉํ์ฌ CPU, GPU, FPGA ๊ฐ๊ฐ์ ํ๊ฒฝ์์, ์ ์๋ ๊ธฐ์ ๋ค์ด ๋ชจ๋ ์ ์๋ฏธํ ์ฑ๋ฅํฅ์์ ๋ฌ์ฑํจ์ ๋ณด์ฌ์ค๋ค.1 INTRODUCTION 1
2 Background 6
2.1 Memory Networks 6
2.2 Deep Learning for NLP 9
3 A Fast and Scalable System Architecture for Memory-Augmented Neural Networks 14
3.1 Motivation & Design Goals 14
3.1.1 Performance Problems in MemNN - High Off-chip Memory Bandwidth Requirements 15
3.1.2 Performance Problems in MemNN - High Computation 16
3.1.3 Performance Problems in MemNN - Shared Cache Contention 17
3.1.4 Design Goals 18
3.2 MnnFast 19
3.2.1 Column-Based Algorithm 19
3.2.2 Zero Skipping 22
3.2.3 Embedding Cache 25
3.3 Implementation 26
3.3.1 General-Purpose Architecture - CPU 26
3.3.2 General-Purpose Architecture - GPU 28
3.3.3 Custom Hardware (FPGA) 29
3.4 Evaluation 31
3.4.1 Experimental Setup 31
3.4.2 CPU 33
3.4.3 GPU 35
3.4.4 FPGA 37
3.4.5 Comparison Between CPU and FPGA 39
3.5 Conclusion 39
4 A Fast, Scalable, and Flexible System for Large-Scale Heterogeneous NLP Models 40
4.1 Motivation & Design Goals 40
4.1.1 High Model Complexity 40
4.1.2 High Memory Bandwidth 41
4.1.3 Heavy Computation 42
4.1.4 Huge Performance Variation 43
4.1.5 Design Goals 43
4.2 NLP-Fast 44
4.2.1 Bottleneck Analysis of NLP Models 44
4.2.2 Holistic Model Partitioning 47
4.2.3 Cross-operation Zero Skipping 51
4.2.4 Adaptive Hardware Reconfiguration 54
4.3 NLP-Fast Toolkit 56
4.4 Implementation 59
4.4.1 General-Purpose Architecture - CPU 59
4.4.2 General-Purpose Architecture - GPU 61
4.4.3 Custom Hardware (FPGA) 62
4.5 Evaluation 64
4.5.1 Experimental Setup 65
4.5.2 CPU 65
4.5.3 GPU 67
4.5.4 FPGA 69
4.6 Conclusion 72
5 Related Work 73
5.1 Various DNN Accelerators 73
5.2 Various NLP Accelerators 74
5.3 Model Partitioning 75
5.4 Approximation 76
5.5 Improving Flexibility 78
5.6 Resource Optimization 78
6 Conclusion 80
Abstract (In Korean) 106Docto
DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation
Transformer is a deep learning language model widely used for natural
language processing (NLP) services in datacenters. Among transformer models,
Generative Pre-trained Transformer (GPT) has achieved remarkable performance in
text generation, or natural language generation (NLG), which needs the
processing of a large input context in the summarization stage, followed by the
generation stage that produces a single word at a time. The conventional
platforms such as GPU are specialized for the parallel processing of large
inputs in the summarization stage, but their performance significantly degrades
in the generation stage due to its sequential characteristic. Therefore, an
efficient hardware platform is required to address the high latency caused by
the sequential characteristic of text generation.
In this paper, we present DFX, a multi-FPGA acceleration appliance that
executes GPT-2 model inference end-to-end with low latency and high throughput
in both summarization and generation stages. DFX uses model parallelism and
optimized dataflow that is model-and-hardware-aware for fast simultaneous
workload execution among devices. Its compute cores operate on custom
instructions and provide GPT-2 operations end-to-end. We implement the proposed
hardware architecture on four Xilinx Alveo U280 FPGAs and utilize all of the
channels of the high bandwidth memory (HBM) and the maximum number of compute
resources for high hardware efficiency. DFX achieves 5.58x speedup and 3.99x
energy efficiency over four NVIDIA V100 GPUs on the modern GPT-2 model. DFX is
also 8.21x more cost-effective than the GPU appliance, suggesting that it is a
promising solution for text generation workloads in cloud datacenters.Comment: Extension of HOTCHIPS 2022 and accepted in MICRO 202
Energy-Efficient Recurrent Neural Network Accelerators for Real-Time Inference
Over the past decade, Deep Learning (DL) and Deep Neural Network (DNN) have gone through a rapid development. They are now vastly applied to various applications and have profoundly changed the life of hu- man beings. As an essential element of DNN, Recurrent Neural Networks (RNN) are helpful in processing time-sequential data and are widely used in applications such as speech recognition and machine translation. RNNs are difficult to compute because of their massive arithmetic operations and large memory footprint. RNN inference workloads used to be executed on conventional general-purpose processors including Central Processing Units (CPU) and Graphics Processing Units (GPU); however, they have un- necessary hardware blocks for RNN computation such as branch predictor, caching system, making them not optimal for RNN processing. To accelerate RNN computations and outperform the performance of conventional processors, previous work focused on optimization methods on both software and hardware. On the software side, previous works mainly used model compression to reduce the memory footprint and the arithmetic operations of RNNs. On the hardware side, previous works also designed domain-specific hardware accelerators based on Field Pro- grammable Gate Arrays (FPGA) or Application Specific Integrated Circuits (ASIC) with customized hardware pipelines optimized for efficient pro- cessing of RNNs. By following this software-hardware co-design strategy, previous works achieved at least 10X speedup over conventional processors. Many previous works focused on achieving high throughput with a large batch of input streams. However, in real-time applications, such as gaming Artificial Intellegence (AI), dynamical system control, low latency is more critical. Moreover, there is a trend of offloading neural network workloads to edge devices to provide a better user experience and privacy protection. Edge devices, such as mobile phones and wearable devices, are usually resource-constrained with a tight power budget. They require RNN hard- ware that is more energy-efficient to realize both low-latency inference and long battery life. Brain neurons have sparsity in both the spatial domain and time domain. Inspired by this human nature, previous work mainly explored model compression to induce spatial sparsity in RNNs. The delta network algorithm alternatively induces temporal sparsity in RNNs and can save over 10X arithmetic operations in RNNs proven by previous works.
In this work, we have proposed customized hardware accelerators to exploit temporal sparsity in Gated Recurrent Unit (GRU)-RNNs and Long Short-Term Memory (LSTM)-RNNs to achieve energy-efficient real-time RNN inference. First, we have proposed DeltaRNN, the first-ever RNN accelerator to exploit temporal sparsity in GRU-RNNs. DeltaRNN has achieved 1.2 TOp/s effective throughput with a batch size of 1, which is 15X higher than its related works. Second, we have designed EdgeDRNN to accelerate GRU-RNN edge inference. Compared to DeltaRNN, EdgeDRNN does not rely on on-chip memory to store RNN weights and focuses on reducing off-chip Dynamic Random Access Memory (DRAM) data traffic using a more scalable architecture. EdgeDRNN have realized real-time inference of large GRU-RNNs with submillisecond latency and only 2.3 W wall plug power consumption, achieving 4X higher energy efficiency than commercial edge AI platforms like NVIDIA Jetson Nano. Third, we have used DeltaRNN to realize the first-ever continuous speech recognition sys- tem with the Dynamic Audio Sensor (DAS) as the front-end. The DAS is a neuromorphic event-driven sensor that produces a stream of asyn- chronous events instead of audio data sampled at a fixed sample rate. We have also showcased how an RNN accelerator can be integrated with an event-driven sensor on the same chip to realize ultra-low-power Keyword Spotting (KWS) on the extreme edge. Fourth, we have used EdgeDRNN to control a powered robotic prosthesis using an RNN controller to replace a conventional proportionalโderivative (PD) controller. EdgeDRNN has achieved 21 ฮผs latency of running the RNN controller and could maintain stable control of the prosthesis. We have used DeltaRNN and EdgeDRNN to solve these problems to prove their value in solving real-world problems. Finally, we have applied the delta network algorithm on LSTM-RNNs and have combined it with a customized structured pruning method, called Column-Balanced Targeted Dropout (CBTD), to induce spatio-temporal sparsity in LSTM-RNNs. Then, we have proposed another FPGA-based accelerator called Spartus, the first RNN accelerator that exploits spatio- temporal sparsity. Spartus achieved 9.4 TOp/s effective throughput with a batch size of 1, the highest among present FPGA-based RNN accelerators with a power budget around 10 W. Spartus can complete the inference of an LSTM layer having 5 million parameters within 1 ฮผs
MARS: Exploiting Multi-Level Parallelism for DNN Workloads on Adaptive Multi-Accelerator Systems
Along with the fast evolution of deep neural networks, the hardware system is
also developing rapidly. As a promising solution achieving high scalability and
low manufacturing cost, multi-accelerator systems widely exist in data centers,
cloud platforms, and SoCs. Thus, a challenging problem arises in
multi-accelerator systems: selecting a proper combination of accelerators from
available designs and searching for efficient DNN mapping strategies. To this
end, we propose MARS, a novel mapping framework that can perform
computation-aware accelerator selection, and apply communication-aware sharding
strategies to maximize parallelism. Experimental results show that MARS can
achieve 32.2% latency reduction on average for typical DNN workloads compared
to the baseline, and 59.4% latency reduction on heterogeneous models compared
to the corresponding state-of-the-art method.Comment: Accepted by 60th DA
EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge
Despite the remarkable strides of Large Language Models (LLMs) in various
fields, the wide applications of LLMs on edge devices are limited due to their
massive parameters and computations. To address this, quantization is commonly
adopted to generate lightweight LLMs with efficient computations and fast
inference. However, Post-Training Quantization (PTQ) methods dramatically
degrade in quality when quantizing weights, activations, and KV cache together
to below 8 bits. Besides, many Quantization-Aware Training (QAT) works quantize
model weights, leaving the activations untouched, which do not fully exploit
the potential of quantization for inference acceleration on the edge. In this
paper, we propose EdgeQAT, the Entropy and Distribution Guided QAT for the
optimization of lightweight LLMs to achieve inference acceleration on Edge
devices. We first identify that the performance drop of quantization primarily
stems from the information distortion in quantized attention maps, demonstrated
by the different distributions in quantized query and key of the self-attention
mechanism. Then, the entropy and distribution guided QAT is proposed to
mitigate the information distortion. Moreover, we design a token
importance-aware adaptive method to dynamically quantize the tokens with
different bit widths for further optimization and acceleration. Our extensive
experiments verify the substantial improvements with our framework across
various datasets. Furthermore, we achieve an on-device speedup of up to 2.37x
compared with its FP16 counterparts across multiple edge devices, signaling a
groundbreaking advancement.Comment: Preprin
HMC-Based Accelerator Design For Compressed Deep Neural Networks
Deep Neural Networks (DNNs) offer remarkable performance of classifications and regressions in many high dimensional problems and have been widely utilized in real-word cognitive applications. In DNN applications, high computational cost of DNNs greatly hinder their deployment in resource-constrained applications, real-time systems and edge computing platforms. Moreover, energy consumption and performance cost of moving data between memory hierarchy and computational units are higher than that of the computation itself. To overcome the memory bottleneck, data locality and temporal data reuse are improved in accelerator design. In an attempt to further improve data locality, memory manufacturers have invented 3D-stacked memory where multiple layers of memory arrays are stacked on top of each other. Inherited from the concept of Process-In-Memory (PIM), some 3D-stacked memory architectures also include a logic layer that can integrate general-purpose computational logic directly within main memory to take advantages of high internal bandwidth during computation.
In this dissertation, we are going to investigate hardware/software co-design for neural network accelerator. Specifically, we introduce a two-phase filter pruning framework for model compression and an accelerator tailored for efficient DNN execution on HMC, which can dynamically offload the primitives and functions to PIM logic layer through a latency-aware scheduling controller.
In our compression framework, we formulate filter pruning process as an optimization problem and propose a filter selection criterion measured by conditional entropy. The key idea of our proposed approach is to establish a quantitative connection between filters and model accuracy. We define the connection as conditional entropy over filters in a convolutional layer, i.e., distribution of entropy conditioned on network loss. Based on the definition, different pruning efficiencies of global and layer-wise pruning strategies are compared, and two-phase pruning method is proposed. The proposed pruning method can achieve a reduction of 88% filters and 46% inference time reduction on VGG16 within 2% accuracy degradation.
In this dissertation, we are going to investigate hardware/software co-design for neural network accelerator. Specifically, we introduce a two-phase filter pruning framework for model compres- sion and an accelerator tailored for efficient DNN execution on HMC, which can dynamically offload the primitives and functions to PIM logic layer through a latency-aware scheduling con- troller.
In our compression framework, we formulate filter pruning process as an optimization problem and propose a filter selection criterion measured by conditional entropy. The key idea of our proposed approach is to establish a quantitative connection between filters and model accuracy. We define the connection as conditional entropy over filters in a convolutional layer, i.e., distribution of entropy conditioned on network loss. Based on the definition, different pruning efficiencies of global and layer-wise pruning strategies are compared, and two-phase pruning method is proposed. The proposed pruning method can achieve a reduction of 88% filters and 46% inference time reduction on VGG16 within 2% accuracy degradation
Approximate Computing Survey, Part II: Application-Specific & Architectural Approximation Techniques and Applications
The challenging deployment of compute-intensive applications from domains
such Artificial Intelligence (AI) and Digital Signal Processing (DSP), forces
the community of computing systems to explore new design approaches.
Approximate Computing appears as an emerging solution, allowing to tune the
quality of results in the design of a system in order to improve the energy
efficiency and/or performance. This radical paradigm shift has attracted
interest from both academia and industry, resulting in significant research on
approximation techniques and methodologies at different design layers (from
system down to integrated circuits). Motivated by the wide appeal of
Approximate Computing over the last 10 years, we conduct a two-part survey to
cover key aspects (e.g., terminology and applications) and review the
state-of-the art approximation techniques from all layers of the traditional
computing stack. In Part II of our survey, we classify and present the
technical details of application-specific and architectural approximation
techniques, which both target the design of resource-efficient
processors/accelerators & systems. Moreover, we present a detailed analysis of
the application spectrum of Approximate Computing and discuss open challenges
and future directions.Comment: Under Review at ACM Computing Survey
- โฆ