135 research outputs found

    ์ด์ข… ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์„ ์œ„ํ•œ ํ™•์žฅํ˜• ์ปดํ“จํ„ฐ ์‹œ์Šคํ…œ ์„ค๊ณ„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2021. 2. ๊น€์žฅ์šฐ.Modern neural-network (NN) accelerators have been successful by accelerating a small number of basic operations (e.g., convolution, fully-connected, feedback) comprising the specific target neural-network models (e.g., CNN, RNN). However, this approach no longer works for the emerging full-scale natural language processing (NLP)-based neural network models (e.g., Memory networks, Transformer, BERT), which consist of different combinations of complex and heterogeneous operations (e.g., self-attention, multi-head attention, large-scale feed-forward). Existing acceleration proposals cover only the proposal-specific basic operations and/or customize them for specific models only, which leads to the low performance improvement and the narrow model coverage. Therefore, an ideal NLP accelerator should first identify all performance-critical operations required by different NLP models and support them as a single accelerator to achieve a high model coverage, and can adaptively optimize its architecture to achieve the best performance for the given model. To address these scalability and model/config diversity issues, the dissertation introduces two novel projects (i.e., MnnFast and NLP-Fast) to efficiently accelerate a wide spectrum of full-scale NLP models. First, MnnFast proposes three novel optimizations to resolve three major performance problems (i.e., high memory bandwidth, heavy computation, and cache contention) in memory-augmented neural networks. Next, NLP-Fast adopts three optimization techniques to resolve the huge performance variation due to the model/config diversity in emerging NLP models. We implement both MnnFast and NLP-Fast on different hardware platforms (i.e., CPU, GPU, FPGA) and thoroughly evaluate their performance improvement on each platform.์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์˜ ์ค‘์š”์„ฑ์ด ๋Œ€๋‘๋จ์— ๋”ฐ๋ผ ์—ฌ๋Ÿฌ ๊ธฐ์—… ๋ฐ ์—ฐ๊ตฌ์ง„๋“ค์€ ๋‹ค์–‘ํ•˜๊ณ  ๋ณต์žกํ•œ ์ข…๋ฅ˜์˜ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ๋“ค์„ ์ œ์‹œํ•˜๊ณ  ์žˆ๋‹ค. ์ฆ‰ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ๋“ค์€ ํ˜•ํƒœ๊ฐ€ ๋ณต์žกํ•ด์ง€๊ณ ,๋กœ๊ทœ๋ชจ๊ฐ€ ์ปค์ง€๋ฉฐ, ์ข…๋ฅ˜๊ฐ€ ๋‹ค์–‘ํ•ด์ง€๋Š” ์–‘์ƒ์„ ๋ณด์—ฌ์ค€๋‹ค. ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์€ ์ด๋Ÿฌํ•œ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์˜ ๋ณต์žก์„ฑ, ํ™•์žฅ์„ฑ, ๋‹ค์–‘์„ฑ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋ฅผ ์ œ์‹œํ•˜์˜€๋‹ค. ๊ฐ๊ฐ์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋“ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. (1) ๋‹ค์–‘ํ•œ ์ข…๋ฅ˜์˜ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ์˜ค๋ฒ„ํ—ค๋“œ ๋ถ„ํฌ๋„๋ฅผ ์•Œ์•„๋‚ด๊ธฐ ์œ„ํ•œ ์ •์ /๋™์  ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. (2) ์„ฑ๋Šฅ ๋ถ„์„์„ ํ†ตํ•ด ์•Œ์•„๋‚ธ ์ฃผ๋œ ์„ฑ๋Šฅ ๋ณ‘๋ชฉ ์š”์†Œ๋“ค์˜ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์„ ์ตœ์ ํ™” ํ•˜๊ธฐ ์œ„ํ•œ ์ „์ฒด๋ก ์  ๋ชจ๋ธ ๋ณ‘๋ ฌํ™” ๊ธฐ์ˆ ์„ ์ œ์‹œํ•œ๋‹ค. (3) ์—ฌ๋Ÿฌ ์—ฐ์‚ฐ๋“ค์˜ ์—ฐ์‚ฐ๋Ÿ‰์„ ๊ฐ์†Œํ•˜๋Š” ๊ธฐ์ˆ ๊ณผ ์—ฐ์‚ฐ๋Ÿ‰ ๊ฐ์†Œ๋กœ ์ธํ•œ skewness ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ dynamic scheduler ๊ธฐ์ˆ ์„ ์ œ์‹œํ•œ๋‹ค. (4) ํ˜„ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ๋‹ค์–‘์„ฑ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ ๋ชจ๋ธ์— ์ตœ์ ํ™”๋œ ๋””์ž์ธ์„ ์ œ์‹œํ•˜๋Š” ๊ธฐ์ˆ ์„ ์ œ์‹œํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ํ•ต์‹ฌ ๊ธฐ์ˆ ๋“ค์€ ์—ฌ๋Ÿฌ ์ข…๋ฅ˜์˜ ํ•˜๋“œ์›จ์–ด ๊ฐ€์†๊ธฐ (์˜ˆ: CPU, GPU, FPGA, ASIC) ์—๋„ ๋ฒ”์šฉ์ ์œผ๋กœ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋งค์šฐ ํšจ๊ณผ์ ์ด๋ฏ€๋กœ, ์ œ์‹œ๋œ ๊ธฐ์ˆ ๋“ค์€ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์„ ์œ„ํ•œ ์ปดํ“จํ„ฐ ์‹œ์Šคํ…œ ์„ค๊ณ„ ๋ถ„์•ผ์— ๊ด‘๋ฒ”์œ„ํ•˜๊ฒŒ ์ ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํ•ด๋‹น ๊ธฐ์ˆ ๋“ค์„ ์ ์šฉํ•˜์—ฌ CPU, GPU, FPGA ๊ฐ๊ฐ์˜ ํ™˜๊ฒฝ์—์„œ, ์ œ์‹œ๋œ ๊ธฐ์ˆ ๋“ค์ด ๋ชจ๋‘ ์œ ์˜๋ฏธํ•œ ์„ฑ๋Šฅํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•จ์„ ๋ณด์—ฌ์ค€๋‹ค.1 INTRODUCTION 1 2 Background 6 2.1 Memory Networks 6 2.2 Deep Learning for NLP 9 3 A Fast and Scalable System Architecture for Memory-Augmented Neural Networks 14 3.1 Motivation & Design Goals 14 3.1.1 Performance Problems in MemNN - High Off-chip Memory Bandwidth Requirements 15 3.1.2 Performance Problems in MemNN - High Computation 16 3.1.3 Performance Problems in MemNN - Shared Cache Contention 17 3.1.4 Design Goals 18 3.2 MnnFast 19 3.2.1 Column-Based Algorithm 19 3.2.2 Zero Skipping 22 3.2.3 Embedding Cache 25 3.3 Implementation 26 3.3.1 General-Purpose Architecture - CPU 26 3.3.2 General-Purpose Architecture - GPU 28 3.3.3 Custom Hardware (FPGA) 29 3.4 Evaluation 31 3.4.1 Experimental Setup 31 3.4.2 CPU 33 3.4.3 GPU 35 3.4.4 FPGA 37 3.4.5 Comparison Between CPU and FPGA 39 3.5 Conclusion 39 4 A Fast, Scalable, and Flexible System for Large-Scale Heterogeneous NLP Models 40 4.1 Motivation & Design Goals 40 4.1.1 High Model Complexity 40 4.1.2 High Memory Bandwidth 41 4.1.3 Heavy Computation 42 4.1.4 Huge Performance Variation 43 4.1.5 Design Goals 43 4.2 NLP-Fast 44 4.2.1 Bottleneck Analysis of NLP Models 44 4.2.2 Holistic Model Partitioning 47 4.2.3 Cross-operation Zero Skipping 51 4.2.4 Adaptive Hardware Reconfiguration 54 4.3 NLP-Fast Toolkit 56 4.4 Implementation 59 4.4.1 General-Purpose Architecture - CPU 59 4.4.2 General-Purpose Architecture - GPU 61 4.4.3 Custom Hardware (FPGA) 62 4.5 Evaluation 64 4.5.1 Experimental Setup 65 4.5.2 CPU 65 4.5.3 GPU 67 4.5.4 FPGA 69 4.6 Conclusion 72 5 Related Work 73 5.1 Various DNN Accelerators 73 5.2 Various NLP Accelerators 74 5.3 Model Partitioning 75 5.4 Approximation 76 5.5 Improving Flexibility 78 5.6 Resource Optimization 78 6 Conclusion 80 Abstract (In Korean) 106Docto

    DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

    Full text link
    Transformer is a deep learning language model widely used for natural language processing (NLP) services in datacenters. Among transformer models, Generative Pre-trained Transformer (GPT) has achieved remarkable performance in text generation, or natural language generation (NLG), which needs the processing of a large input context in the summarization stage, followed by the generation stage that produces a single word at a time. The conventional platforms such as GPU are specialized for the parallel processing of large inputs in the summarization stage, but their performance significantly degrades in the generation stage due to its sequential characteristic. Therefore, an efficient hardware platform is required to address the high latency caused by the sequential characteristic of text generation. In this paper, we present DFX, a multi-FPGA acceleration appliance that executes GPT-2 model inference end-to-end with low latency and high throughput in both summarization and generation stages. DFX uses model parallelism and optimized dataflow that is model-and-hardware-aware for fast simultaneous workload execution among devices. Its compute cores operate on custom instructions and provide GPT-2 operations end-to-end. We implement the proposed hardware architecture on four Xilinx Alveo U280 FPGAs and utilize all of the channels of the high bandwidth memory (HBM) and the maximum number of compute resources for high hardware efficiency. DFX achieves 5.58x speedup and 3.99x energy efficiency over four NVIDIA V100 GPUs on the modern GPT-2 model. DFX is also 8.21x more cost-effective than the GPU appliance, suggesting that it is a promising solution for text generation workloads in cloud datacenters.Comment: Extension of HOTCHIPS 2022 and accepted in MICRO 202

    Energy-Efficient Recurrent Neural Network Accelerators for Real-Time Inference

    Full text link
    Over the past decade, Deep Learning (DL) and Deep Neural Network (DNN) have gone through a rapid development. They are now vastly applied to various applications and have profoundly changed the life of hu- man beings. As an essential element of DNN, Recurrent Neural Networks (RNN) are helpful in processing time-sequential data and are widely used in applications such as speech recognition and machine translation. RNNs are difficult to compute because of their massive arithmetic operations and large memory footprint. RNN inference workloads used to be executed on conventional general-purpose processors including Central Processing Units (CPU) and Graphics Processing Units (GPU); however, they have un- necessary hardware blocks for RNN computation such as branch predictor, caching system, making them not optimal for RNN processing. To accelerate RNN computations and outperform the performance of conventional processors, previous work focused on optimization methods on both software and hardware. On the software side, previous works mainly used model compression to reduce the memory footprint and the arithmetic operations of RNNs. On the hardware side, previous works also designed domain-specific hardware accelerators based on Field Pro- grammable Gate Arrays (FPGA) or Application Specific Integrated Circuits (ASIC) with customized hardware pipelines optimized for efficient pro- cessing of RNNs. By following this software-hardware co-design strategy, previous works achieved at least 10X speedup over conventional processors. Many previous works focused on achieving high throughput with a large batch of input streams. However, in real-time applications, such as gaming Artificial Intellegence (AI), dynamical system control, low latency is more critical. Moreover, there is a trend of offloading neural network workloads to edge devices to provide a better user experience and privacy protection. Edge devices, such as mobile phones and wearable devices, are usually resource-constrained with a tight power budget. They require RNN hard- ware that is more energy-efficient to realize both low-latency inference and long battery life. Brain neurons have sparsity in both the spatial domain and time domain. Inspired by this human nature, previous work mainly explored model compression to induce spatial sparsity in RNNs. The delta network algorithm alternatively induces temporal sparsity in RNNs and can save over 10X arithmetic operations in RNNs proven by previous works. In this work, we have proposed customized hardware accelerators to exploit temporal sparsity in Gated Recurrent Unit (GRU)-RNNs and Long Short-Term Memory (LSTM)-RNNs to achieve energy-efficient real-time RNN inference. First, we have proposed DeltaRNN, the first-ever RNN accelerator to exploit temporal sparsity in GRU-RNNs. DeltaRNN has achieved 1.2 TOp/s effective throughput with a batch size of 1, which is 15X higher than its related works. Second, we have designed EdgeDRNN to accelerate GRU-RNN edge inference. Compared to DeltaRNN, EdgeDRNN does not rely on on-chip memory to store RNN weights and focuses on reducing off-chip Dynamic Random Access Memory (DRAM) data traffic using a more scalable architecture. EdgeDRNN have realized real-time inference of large GRU-RNNs with submillisecond latency and only 2.3 W wall plug power consumption, achieving 4X higher energy efficiency than commercial edge AI platforms like NVIDIA Jetson Nano. Third, we have used DeltaRNN to realize the first-ever continuous speech recognition sys- tem with the Dynamic Audio Sensor (DAS) as the front-end. The DAS is a neuromorphic event-driven sensor that produces a stream of asyn- chronous events instead of audio data sampled at a fixed sample rate. We have also showcased how an RNN accelerator can be integrated with an event-driven sensor on the same chip to realize ultra-low-power Keyword Spotting (KWS) on the extreme edge. Fourth, we have used EdgeDRNN to control a powered robotic prosthesis using an RNN controller to replace a conventional proportionalโ€“derivative (PD) controller. EdgeDRNN has achieved 21 ฮผs latency of running the RNN controller and could maintain stable control of the prosthesis. We have used DeltaRNN and EdgeDRNN to solve these problems to prove their value in solving real-world problems. Finally, we have applied the delta network algorithm on LSTM-RNNs and have combined it with a customized structured pruning method, called Column-Balanced Targeted Dropout (CBTD), to induce spatio-temporal sparsity in LSTM-RNNs. Then, we have proposed another FPGA-based accelerator called Spartus, the first RNN accelerator that exploits spatio- temporal sparsity. Spartus achieved 9.4 TOp/s effective throughput with a batch size of 1, the highest among present FPGA-based RNN accelerators with a power budget around 10 W. Spartus can complete the inference of an LSTM layer having 5 million parameters within 1 ฮผs

    MARS: Exploiting Multi-Level Parallelism for DNN Workloads on Adaptive Multi-Accelerator Systems

    Full text link
    Along with the fast evolution of deep neural networks, the hardware system is also developing rapidly. As a promising solution achieving high scalability and low manufacturing cost, multi-accelerator systems widely exist in data centers, cloud platforms, and SoCs. Thus, a challenging problem arises in multi-accelerator systems: selecting a proper combination of accelerators from available designs and searching for efficient DNN mapping strategies. To this end, we propose MARS, a novel mapping framework that can perform computation-aware accelerator selection, and apply communication-aware sharding strategies to maximize parallelism. Experimental results show that MARS can achieve 32.2% latency reduction on average for typical DNN workloads compared to the baseline, and 59.4% latency reduction on heterogeneous models compared to the corresponding state-of-the-art method.Comment: Accepted by 60th DA

    EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge

    Full text link
    Despite the remarkable strides of Large Language Models (LLMs) in various fields, the wide applications of LLMs on edge devices are limited due to their massive parameters and computations. To address this, quantization is commonly adopted to generate lightweight LLMs with efficient computations and fast inference. However, Post-Training Quantization (PTQ) methods dramatically degrade in quality when quantizing weights, activations, and KV cache together to below 8 bits. Besides, many Quantization-Aware Training (QAT) works quantize model weights, leaving the activations untouched, which do not fully exploit the potential of quantization for inference acceleration on the edge. In this paper, we propose EdgeQAT, the Entropy and Distribution Guided QAT for the optimization of lightweight LLMs to achieve inference acceleration on Edge devices. We first identify that the performance drop of quantization primarily stems from the information distortion in quantized attention maps, demonstrated by the different distributions in quantized query and key of the self-attention mechanism. Then, the entropy and distribution guided QAT is proposed to mitigate the information distortion. Moreover, we design a token importance-aware adaptive method to dynamically quantize the tokens with different bit widths for further optimization and acceleration. Our extensive experiments verify the substantial improvements with our framework across various datasets. Furthermore, we achieve an on-device speedup of up to 2.37x compared with its FP16 counterparts across multiple edge devices, signaling a groundbreaking advancement.Comment: Preprin

    HMC-Based Accelerator Design For Compressed Deep Neural Networks

    Get PDF
    Deep Neural Networks (DNNs) offer remarkable performance of classifications and regressions in many high dimensional problems and have been widely utilized in real-word cognitive applications. In DNN applications, high computational cost of DNNs greatly hinder their deployment in resource-constrained applications, real-time systems and edge computing platforms. Moreover, energy consumption and performance cost of moving data between memory hierarchy and computational units are higher than that of the computation itself. To overcome the memory bottleneck, data locality and temporal data reuse are improved in accelerator design. In an attempt to further improve data locality, memory manufacturers have invented 3D-stacked memory where multiple layers of memory arrays are stacked on top of each other. Inherited from the concept of Process-In-Memory (PIM), some 3D-stacked memory architectures also include a logic layer that can integrate general-purpose computational logic directly within main memory to take advantages of high internal bandwidth during computation. In this dissertation, we are going to investigate hardware/software co-design for neural network accelerator. Specifically, we introduce a two-phase filter pruning framework for model compression and an accelerator tailored for efficient DNN execution on HMC, which can dynamically offload the primitives and functions to PIM logic layer through a latency-aware scheduling controller. In our compression framework, we formulate filter pruning process as an optimization problem and propose a filter selection criterion measured by conditional entropy. The key idea of our proposed approach is to establish a quantitative connection between filters and model accuracy. We define the connection as conditional entropy over filters in a convolutional layer, i.e., distribution of entropy conditioned on network loss. Based on the definition, different pruning efficiencies of global and layer-wise pruning strategies are compared, and two-phase pruning method is proposed. The proposed pruning method can achieve a reduction of 88% filters and 46% inference time reduction on VGG16 within 2% accuracy degradation. In this dissertation, we are going to investigate hardware/software co-design for neural network accelerator. Specifically, we introduce a two-phase filter pruning framework for model compres- sion and an accelerator tailored for efficient DNN execution on HMC, which can dynamically offload the primitives and functions to PIM logic layer through a latency-aware scheduling con- troller. In our compression framework, we formulate filter pruning process as an optimization problem and propose a filter selection criterion measured by conditional entropy. The key idea of our proposed approach is to establish a quantitative connection between filters and model accuracy. We define the connection as conditional entropy over filters in a convolutional layer, i.e., distribution of entropy conditioned on network loss. Based on the definition, different pruning efficiencies of global and layer-wise pruning strategies are compared, and two-phase pruning method is proposed. The proposed pruning method can achieve a reduction of 88% filters and 46% inference time reduction on VGG16 within 2% accuracy degradation

    Approximate Computing Survey, Part II: Application-Specific & Architectural Approximation Techniques and Applications

    Full text link
    The challenging deployment of compute-intensive applications from domains such Artificial Intelligence (AI) and Digital Signal Processing (DSP), forces the community of computing systems to explore new design approaches. Approximate Computing appears as an emerging solution, allowing to tune the quality of results in the design of a system in order to improve the energy efficiency and/or performance. This radical paradigm shift has attracted interest from both academia and industry, resulting in significant research on approximation techniques and methodologies at different design layers (from system down to integrated circuits). Motivated by the wide appeal of Approximate Computing over the last 10 years, we conduct a two-part survey to cover key aspects (e.g., terminology and applications) and review the state-of-the art approximation techniques from all layers of the traditional computing stack. In Part II of our survey, we classify and present the technical details of application-specific and architectural approximation techniques, which both target the design of resource-efficient processors/accelerators & systems. Moreover, we present a detailed analysis of the application spectrum of Approximate Computing and discuss open challenges and future directions.Comment: Under Review at ACM Computing Survey
    • โ€ฆ
    corecore