245 research outputs found
Emerging accelerator platforms for data centers
CPU and GPU platforms may not be the best options for many emerging compute patterns, which led to a new breed of emerging accelerator platforms. This article gives a comprehensive overview with a focus on commercial platforms
Recommended from our members
Efficient Learning in Heterogeneous Internet of Things Ecosystems
The Internet of Things (IoT) is a growing network of heterogeneous devices, combining various sensing and computing nodes at different scales, which creates a large volume of data. Many IoT applications use machine learning (ML) algorithms to analyze the data. The high computational complexity of ML workloads poses significant computational challenges to IoT computing platforms, which tend to be less-powerful and resource-constrained devices. Transmitting such large volumes of data to the cloud also have various issues such as scalability, security and privacy. In this dissertation, we propose efficient solutions to perform the ML tasks while decreasing power consumption and improving performance. We first leverage the heterogeneous and interconnected nature of the IoT systems, where IoT applications run on many different architectures (e.g., X86 server or ARM-based edge device) while communicating with each other. We present a cross-platform power and performance prediction technique for intelligent task allocation. The proposed technique estimates the time-variant energy consumption with only 7% error across completely different architectures, enabling the intelligent task allocation that saves the energy consumption of 16.5% for state-of-the-art ML workloads.We next show how to further advance the learning procedures towards real-time and online processing by distributing such learning tasks onto the hierarchy of IoT devices. Our solution leverages brain-inspired high-dimensional (HD) computing to derive a new class oflearning algorithms that can easily run on IoT devices, while providing high accuracy comparable to the state-of-the-arts. We present that the HD-based learning algorithms can cover various real-world problems from conventional classification to other cognitive tasks beyond classical MLs such as DNA pattern matching. We demonstrate that the HD-based learning can enable secure, collaborative learning by efficiently distributing a large volume of learning tasks into heterogeneous computing nodes. We have implemented the proposed learning solution on various platforms while offering superior computing efficiency. For example, our solution achieves 486Γand 7Γ performance improvements for each of the training and inference phases on a low-power ARM processor, as compared to state-of-the-art deep learning
Doctor of Philosophy
dissertationDeep Neural Networks (DNNs) are the state-of-art solution in a growing number of tasks including computer vision, speech recognition, and genomics. However, DNNs are computationally expensive as they are carefully trained to extract and abstract features from raw data using multiple layers of neurons with millions of parameters. In this dissertation, we primarily focus on inference, e.g., using a DNN to classify an input image. This is an operation that will be repeatedly performed on billions of devices in the datacenter, in self-driving cars, in drones, etc. We observe that DNNs spend a vast majority of their runtime to runtime performing matrix-by-vector multiplications (MVM). MVMs have two major bottlenecks: fetching the matrix and performing sum-of-product operations. To address these bottlenecks, we use in-situ computing, where the matrix is stored in programmable resistor arrays, called crossbars, and sum-of-product operations are performed using analog computing. In this dissertation, we propose two hardware units, ISAAC and Newton.In ISAAC, we show that in-situ computing designs can outperform DNN digital accelerators, if they leverage pipelining, smart encodings, and can distribute a computation in time and space, within crossbars, and across crossbars. In the ISAAC design, roughly half the chip area/power can be attributed to the analog-to-digital conversion (ADC), i.e., it remains the key design challenge in mixed-signal accelerators for deep networks. In spite of the ADC bottleneck, ISAAC is able to out-perform the computational efficiency of the state-of-the-art design (DaDianNao) by 8x. In Newton, we take advantage of a number of techniques to address ADC inefficiency. These techniques exploit matrix transformations, heterogeneity, and smart mapping of computation to the analog substrate. We show that Newton can increase the efficiency of in-situ computing by an additional 2x. Finally, we show that in-situ computing, unfortunately, cannot be easily adapted to handle training of deep networks, i.e., it is only suitable for inference of already-trained networks. By improving the efficiency of DNN inference with ISAAC and Newton, we move closer to low-cost deep learning that in turn will have societal impact through self-driving cars, assistive systems for the disabled, and precision medicine
Hardware Implementation of Deep Network Accelerators Towards Healthcare and Biomedical Applications
With the advent of dedicated Deep Learning (DL) accelerators and neuromorphic
processors, new opportunities are emerging for applying deep and Spiking Neural
Network (SNN) algorithms to healthcare and biomedical applications at the edge.
This can facilitate the advancement of the medical Internet of Things (IoT)
systems and Point of Care (PoC) devices. In this paper, we provide a tutorial
describing how various technologies ranging from emerging memristive devices,
to established Field Programmable Gate Arrays (FPGAs), and mature Complementary
Metal Oxide Semiconductor (CMOS) technology can be used to develop efficient DL
accelerators to solve a wide variety of diagnostic, pattern recognition, and
signal processing problems in healthcare. Furthermore, we explore how spiking
neuromorphic processors can complement their DL counterparts for processing
biomedical signals. After providing the required background, we unify the
sparsely distributed research on neural network and neuromorphic hardware
implementations as applied to the healthcare domain. In addition, we benchmark
various hardware platforms by performing a biomedical electromyography (EMG)
signal processing task and drawing comparisons among them in terms of inference
delay and energy. Finally, we provide our analysis of the field and share a
perspective on the advantages, disadvantages, challenges, and opportunities
that different accelerators and neuromorphic processors introduce to healthcare
and biomedical domains. This paper can serve a large audience, ranging from
nanoelectronics researchers, to biomedical and healthcare practitioners in
grasping the fundamental interplay between hardware, algorithms, and clinical
adoption of these tools, as we shed light on the future of deep networks and
spiking neuromorphic processing systems as proponents for driving biomedical
circuits and systems forward.Comment: Submitted to IEEE Transactions on Biomedical Circuits and Systems (21
pages, 10 figures, 5 tables
μ΄μ’ μμ°μ΄ μ²λ¦¬ λͺ¨λΈμ μν νμ₯ν μ»΄ν¨ν° μμ€ν μ€κ³
νμλ
Όλ¬Έ (λ°μ¬) -- μμΈλνκ΅ λνμ : 곡과λν μ κΈ°Β·μ 보곡νλΆ, 2021. 2. κΉμ₯μ°.Modern neural-network (NN) accelerators have been successful by accelerating a small number of basic operations (e.g., convolution, fully-connected, feedback) comprising the specific target neural-network models (e.g., CNN, RNN). However, this approach no longer works for the emerging full-scale natural language processing (NLP)-based neural network models (e.g., Memory networks, Transformer, BERT), which consist of different combinations of complex and heterogeneous operations (e.g., self-attention, multi-head attention, large-scale feed-forward). Existing acceleration proposals cover only the proposal-specific basic operations and/or customize them for specific models only, which leads to the low performance improvement and the narrow model coverage. Therefore, an ideal NLP accelerator should first identify all performance-critical operations required by different NLP models and support them as a single accelerator to achieve a high model coverage, and can adaptively optimize its architecture to achieve the best performance for the given model.
To address these scalability and model/config diversity issues, the dissertation introduces two novel projects (i.e., MnnFast and NLP-Fast) to efficiently accelerate a wide spectrum of full-scale NLP models. First, MnnFast proposes three novel optimizations to resolve three major performance problems (i.e., high memory bandwidth, heavy computation, and cache contention) in memory-augmented neural networks. Next, NLP-Fast adopts three optimization techniques to resolve the huge performance variation due to the model/config diversity in emerging NLP models. We implement both MnnFast and NLP-Fast on different hardware platforms (i.e., CPU, GPU, FPGA) and thoroughly evaluate their performance improvement on each platform.μμ°μ΄ μ²λ¦¬μ μ€μμ±μ΄ λλλ¨μ λ°λΌ μ¬λ¬ κΈ°μ
λ° μ°κ΅¬μ§λ€μ λ€μνκ³ λ³΅μ‘ν μ’
λ₯μ μμ°μ΄ μ²λ¦¬ λͺ¨λΈλ€μ μ μνκ³ μλ€. μ¦ μμ°μ΄ μ²λ¦¬ λͺ¨λΈλ€μ ννκ° λ³΅μ‘ν΄μ§κ³ ,λ‘κ·λͺ¨κ° 컀μ§λ©°, μ’
λ₯κ° λ€μν΄μ§λ μμμ 보μ¬μ€λ€. λ³Έ νμλ
Όλ¬Έμ μ΄λ¬ν μμ°μ΄ μ²λ¦¬ λͺ¨λΈμ 볡μ‘μ±, νμ₯μ±, λ€μμ±μ ν΄κ²°νκΈ° μν΄ μ¬λ¬ ν΅μ¬ μμ΄λμ΄λ₯Ό μ μνμλ€. κ°κ°μ ν΅μ¬ μμ΄λμ΄λ€μ λ€μκ³Ό κ°λ€. (1) λ€μν μ’
λ₯μ μμ°μ΄ μ²λ¦¬ λͺ¨λΈμ μ±λ₯ μ€λ²ν€λ λΆν¬λλ₯Ό μμλ΄κΈ° μν μ μ /λμ λΆμμ μννλ€. (2) μ±λ₯ λΆμμ ν΅ν΄ μμλΈ μ£Όλ μ±λ₯ λ³λͺ© μμλ€μ λ©λͺ¨λ¦¬ μ¬μ©μ μ΅μ ν νκΈ° μν μ μ²΄λ‘ μ λͺ¨λΈ λ³λ ¬ν κΈ°μ μ μ μνλ€. (3) μ¬λ¬ μ°μ°λ€μ μ°μ°λμ κ°μνλ κΈ°μ κ³Ό μ°μ°λ κ°μλ‘ μΈν skewness λ¬Έμ λ₯Ό ν΄κ²°νκΈ° μν dynamic scheduler κΈ°μ μ μ μνλ€. (4) ν μμ°μ΄ μ²λ¦¬ λͺ¨λΈμ μ±λ₯ λ€μμ±μ ν΄κ²°νκΈ° μν΄ κ° λͺ¨λΈμ μ΅μ νλ λμμΈμ μ μνλ κΈ°μ μ μ μνλ€. μ΄λ¬ν ν΅μ¬ κΈ°μ λ€μ μ¬λ¬ μ’
λ₯μ νλμ¨μ΄ κ°μκΈ° (μ: CPU, GPU, FPGA, ASIC) μλ λ²μ©μ μΌλ‘ μ¬μ©λ μ μκΈ° λλ¬Έμ λ§€μ° ν¨κ³Όμ μ΄λ―λ‘, μ μλ κΈ°μ λ€μ μμ°μ΄ μ²λ¦¬ λͺ¨λΈμ μν μ»΄ν¨ν° μμ€ν
μ€κ³ λΆμΌμ κ΄λ²μνκ² μ μ©λ μ μλ€. λ³Έ λ
Όλ¬Έμμλ ν΄λΉ κΈ°μ λ€μ μ μ©νμ¬ CPU, GPU, FPGA κ°κ°μ νκ²½μμ, μ μλ κΈ°μ λ€μ΄ λͺ¨λ μ μλ―Έν μ±λ₯ν₯μμ λ¬μ±ν¨μ 보μ¬μ€λ€.1 INTRODUCTION 1
2 Background 6
2.1 Memory Networks 6
2.2 Deep Learning for NLP 9
3 A Fast and Scalable System Architecture for Memory-Augmented Neural Networks 14
3.1 Motivation & Design Goals 14
3.1.1 Performance Problems in MemNN - High Off-chip Memory Bandwidth Requirements 15
3.1.2 Performance Problems in MemNN - High Computation 16
3.1.3 Performance Problems in MemNN - Shared Cache Contention 17
3.1.4 Design Goals 18
3.2 MnnFast 19
3.2.1 Column-Based Algorithm 19
3.2.2 Zero Skipping 22
3.2.3 Embedding Cache 25
3.3 Implementation 26
3.3.1 General-Purpose Architecture - CPU 26
3.3.2 General-Purpose Architecture - GPU 28
3.3.3 Custom Hardware (FPGA) 29
3.4 Evaluation 31
3.4.1 Experimental Setup 31
3.4.2 CPU 33
3.4.3 GPU 35
3.4.4 FPGA 37
3.4.5 Comparison Between CPU and FPGA 39
3.5 Conclusion 39
4 A Fast, Scalable, and Flexible System for Large-Scale Heterogeneous NLP Models 40
4.1 Motivation & Design Goals 40
4.1.1 High Model Complexity 40
4.1.2 High Memory Bandwidth 41
4.1.3 Heavy Computation 42
4.1.4 Huge Performance Variation 43
4.1.5 Design Goals 43
4.2 NLP-Fast 44
4.2.1 Bottleneck Analysis of NLP Models 44
4.2.2 Holistic Model Partitioning 47
4.2.3 Cross-operation Zero Skipping 51
4.2.4 Adaptive Hardware Reconfiguration 54
4.3 NLP-Fast Toolkit 56
4.4 Implementation 59
4.4.1 General-Purpose Architecture - CPU 59
4.4.2 General-Purpose Architecture - GPU 61
4.4.3 Custom Hardware (FPGA) 62
4.5 Evaluation 64
4.5.1 Experimental Setup 65
4.5.2 CPU 65
4.5.3 GPU 67
4.5.4 FPGA 69
4.6 Conclusion 72
5 Related Work 73
5.1 Various DNN Accelerators 73
5.2 Various NLP Accelerators 74
5.3 Model Partitioning 75
5.4 Approximation 76
5.5 Improving Flexibility 78
5.6 Resource Optimization 78
6 Conclusion 80
Abstract (In Korean) 106Docto
- β¦