667 research outputs found
Algorithm Architecture Co-design for Dense and Sparse Matrix Computations
abstract: With the end of Dennard scaling and Moore's law, architects have moved towards
heterogeneous designs consisting of specialized cores to achieve higher performance
and energy efficiency for a target application domain. Applications of linear algebra
are ubiquitous in the field of scientific computing, machine learning, statistics,
etc. with matrix computations being fundamental to these linear algebra based solutions.
Design of multiple dense (or sparse) matrix computation routines on the
same platform is quite challenging. Added to the complexity is the fact that dense
and sparse matrix computations have large differences in their storage and access
patterns and are difficult to optimize on the same architecture. This thesis addresses
this challenge and introduces a reconfigurable accelerator that supports both dense
and sparse matrix computations efficiently.
The reconfigurable architecture has been optimized to execute the following linear
algebra routines: GEMV (Dense General Matrix Vector Multiplication), GEMM
(Dense General Matrix Matrix Multiplication), TRSM (Triangular Matrix Solver),
LU Decomposition, Matrix Inverse, SpMV (Sparse Matrix Vector Multiplication),
SpMM (Sparse Matrix Matrix Multiplication). It is a multicore architecture where
each core consists of a 2D array of processing elements (PE).
The 2D array of PEs is of size 4x4 and is scheduled to perform 4x4 sized matrix
updates efficiently. A sequence of such updates is used to solve a larger problem inside
a core. A novel partitioned block compressed sparse data structure (PBCSC/PBCSR)
is used to perform sparse kernel updates. Scalable partitioning and mapping schemes
are presented that map input matrices of any given size to the multicore architecture.
Design trade-offs related to the PE array dimension, size of local memory inside a core
and the bandwidth between on-chip memories and the cores have been presented. An
optimal core configuration is developed from this analysis. Synthesis results using a 7nm PDK show that the proposed accelerator can achieve a performance of upto
32 GOPS using a single core.Dissertation/ThesisMasters Thesis Computer Engineering 201
μ΄μ’ μμ°μ΄ μ²λ¦¬ λͺ¨λΈμ μν νμ₯ν μ»΄ν¨ν° μμ€ν μ€κ³
νμλ
Όλ¬Έ (λ°μ¬) -- μμΈλνκ΅ λνμ : 곡과λν μ κΈ°Β·μ 보곡νλΆ, 2021. 2. κΉμ₯μ°.Modern neural-network (NN) accelerators have been successful by accelerating a small number of basic operations (e.g., convolution, fully-connected, feedback) comprising the specific target neural-network models (e.g., CNN, RNN). However, this approach no longer works for the emerging full-scale natural language processing (NLP)-based neural network models (e.g., Memory networks, Transformer, BERT), which consist of different combinations of complex and heterogeneous operations (e.g., self-attention, multi-head attention, large-scale feed-forward). Existing acceleration proposals cover only the proposal-specific basic operations and/or customize them for specific models only, which leads to the low performance improvement and the narrow model coverage. Therefore, an ideal NLP accelerator should first identify all performance-critical operations required by different NLP models and support them as a single accelerator to achieve a high model coverage, and can adaptively optimize its architecture to achieve the best performance for the given model.
To address these scalability and model/config diversity issues, the dissertation introduces two novel projects (i.e., MnnFast and NLP-Fast) to efficiently accelerate a wide spectrum of full-scale NLP models. First, MnnFast proposes three novel optimizations to resolve three major performance problems (i.e., high memory bandwidth, heavy computation, and cache contention) in memory-augmented neural networks. Next, NLP-Fast adopts three optimization techniques to resolve the huge performance variation due to the model/config diversity in emerging NLP models. We implement both MnnFast and NLP-Fast on different hardware platforms (i.e., CPU, GPU, FPGA) and thoroughly evaluate their performance improvement on each platform.μμ°μ΄ μ²λ¦¬μ μ€μμ±μ΄ λλλ¨μ λ°λΌ μ¬λ¬ κΈ°μ
λ° μ°κ΅¬μ§λ€μ λ€μνκ³ λ³΅μ‘ν μ’
λ₯μ μμ°μ΄ μ²λ¦¬ λͺ¨λΈλ€μ μ μνκ³ μλ€. μ¦ μμ°μ΄ μ²λ¦¬ λͺ¨λΈλ€μ ννκ° λ³΅μ‘ν΄μ§κ³ ,λ‘κ·λͺ¨κ° 컀μ§λ©°, μ’
λ₯κ° λ€μν΄μ§λ μμμ 보μ¬μ€λ€. λ³Έ νμλ
Όλ¬Έμ μ΄λ¬ν μμ°μ΄ μ²λ¦¬ λͺ¨λΈμ 볡μ‘μ±, νμ₯μ±, λ€μμ±μ ν΄κ²°νκΈ° μν΄ μ¬λ¬ ν΅μ¬ μμ΄λμ΄λ₯Ό μ μνμλ€. κ°κ°μ ν΅μ¬ μμ΄λμ΄λ€μ λ€μκ³Ό κ°λ€. (1) λ€μν μ’
λ₯μ μμ°μ΄ μ²λ¦¬ λͺ¨λΈμ μ±λ₯ μ€λ²ν€λ λΆν¬λλ₯Ό μμλ΄κΈ° μν μ μ /λμ λΆμμ μννλ€. (2) μ±λ₯ λΆμμ ν΅ν΄ μμλΈ μ£Όλ μ±λ₯ λ³λͺ© μμλ€μ λ©λͺ¨λ¦¬ μ¬μ©μ μ΅μ ν νκΈ° μν μ μ²΄λ‘ μ λͺ¨λΈ λ³λ ¬ν κΈ°μ μ μ μνλ€. (3) μ¬λ¬ μ°μ°λ€μ μ°μ°λμ κ°μνλ κΈ°μ κ³Ό μ°μ°λ κ°μλ‘ μΈν skewness λ¬Έμ λ₯Ό ν΄κ²°νκΈ° μν dynamic scheduler κΈ°μ μ μ μνλ€. (4) ν μμ°μ΄ μ²λ¦¬ λͺ¨λΈμ μ±λ₯ λ€μμ±μ ν΄κ²°νκΈ° μν΄ κ° λͺ¨λΈμ μ΅μ νλ λμμΈμ μ μνλ κΈ°μ μ μ μνλ€. μ΄λ¬ν ν΅μ¬ κΈ°μ λ€μ μ¬λ¬ μ’
λ₯μ νλμ¨μ΄ κ°μκΈ° (μ: CPU, GPU, FPGA, ASIC) μλ λ²μ©μ μΌλ‘ μ¬μ©λ μ μκΈ° λλ¬Έμ λ§€μ° ν¨κ³Όμ μ΄λ―λ‘, μ μλ κΈ°μ λ€μ μμ°μ΄ μ²λ¦¬ λͺ¨λΈμ μν μ»΄ν¨ν° μμ€ν
μ€κ³ λΆμΌμ κ΄λ²μνκ² μ μ©λ μ μλ€. λ³Έ λ
Όλ¬Έμμλ ν΄λΉ κΈ°μ λ€μ μ μ©νμ¬ CPU, GPU, FPGA κ°κ°μ νκ²½μμ, μ μλ κΈ°μ λ€μ΄ λͺ¨λ μ μλ―Έν μ±λ₯ν₯μμ λ¬μ±ν¨μ 보μ¬μ€λ€.1 INTRODUCTION 1
2 Background 6
2.1 Memory Networks 6
2.2 Deep Learning for NLP 9
3 A Fast and Scalable System Architecture for Memory-Augmented Neural Networks 14
3.1 Motivation & Design Goals 14
3.1.1 Performance Problems in MemNN - High Off-chip Memory Bandwidth Requirements 15
3.1.2 Performance Problems in MemNN - High Computation 16
3.1.3 Performance Problems in MemNN - Shared Cache Contention 17
3.1.4 Design Goals 18
3.2 MnnFast 19
3.2.1 Column-Based Algorithm 19
3.2.2 Zero Skipping 22
3.2.3 Embedding Cache 25
3.3 Implementation 26
3.3.1 General-Purpose Architecture - CPU 26
3.3.2 General-Purpose Architecture - GPU 28
3.3.3 Custom Hardware (FPGA) 29
3.4 Evaluation 31
3.4.1 Experimental Setup 31
3.4.2 CPU 33
3.4.3 GPU 35
3.4.4 FPGA 37
3.4.5 Comparison Between CPU and FPGA 39
3.5 Conclusion 39
4 A Fast, Scalable, and Flexible System for Large-Scale Heterogeneous NLP Models 40
4.1 Motivation & Design Goals 40
4.1.1 High Model Complexity 40
4.1.2 High Memory Bandwidth 41
4.1.3 Heavy Computation 42
4.1.4 Huge Performance Variation 43
4.1.5 Design Goals 43
4.2 NLP-Fast 44
4.2.1 Bottleneck Analysis of NLP Models 44
4.2.2 Holistic Model Partitioning 47
4.2.3 Cross-operation Zero Skipping 51
4.2.4 Adaptive Hardware Reconfiguration 54
4.3 NLP-Fast Toolkit 56
4.4 Implementation 59
4.4.1 General-Purpose Architecture - CPU 59
4.4.2 General-Purpose Architecture - GPU 61
4.4.3 Custom Hardware (FPGA) 62
4.5 Evaluation 64
4.5.1 Experimental Setup 65
4.5.2 CPU 65
4.5.3 GPU 67
4.5.4 FPGA 69
4.6 Conclusion 72
5 Related Work 73
5.1 Various DNN Accelerators 73
5.2 Various NLP Accelerators 74
5.3 Model Partitioning 75
5.4 Approximation 76
5.5 Improving Flexibility 78
5.6 Resource Optimization 78
6 Conclusion 80
Abstract (In Korean) 106Docto
Modeling the Resource Requirements of Convolutional Neural Networks on Mobile Devices
Convolutional Neural Networks (CNNs) have revolutionized the research in
computer vision, due to their ability to capture complex patterns, resulting in
high inference accuracies. However, the increasingly complex nature of these
neural networks means that they are particularly suited for server computers
with powerful GPUs. We envision that deep learning applications will be
eventually and widely deployed on mobile devices, e.g., smartphones,
self-driving cars, and drones. Therefore, in this paper, we aim to understand
the resource requirements (time, memory) of CNNs on mobile devices. First, by
deploying several popular CNNs on mobile CPUs and GPUs, we measure and analyze
the performance and resource usage for every layer of the CNNs. Our findings
point out the potential ways of optimizing the performance on mobile devices.
Second, we model the resource requirements of the different CNN computations.
Finally, based on the measurement, pro ling, and modeling, we build and
evaluate our modeling tool, Augur, which takes a CNN configuration (descriptor)
as the input and estimates the compute time and resource usage of the CNN, to
give insights about whether and how e ciently a CNN can be run on a given
mobile platform. In doing so Augur tackles several challenges: (i) how to
overcome pro ling and measurement overhead; (ii) how to capture the variance in
different mobile platforms with different processors, memory, and cache sizes;
and (iii) how to account for the variance in the number, type and size of
layers of the different CNN configurations
- β¦