667 research outputs found

    Algorithm Architecture Co-design for Dense and Sparse Matrix Computations

    Get PDF
    abstract: With the end of Dennard scaling and Moore's law, architects have moved towards heterogeneous designs consisting of specialized cores to achieve higher performance and energy efficiency for a target application domain. Applications of linear algebra are ubiquitous in the field of scientific computing, machine learning, statistics, etc. with matrix computations being fundamental to these linear algebra based solutions. Design of multiple dense (or sparse) matrix computation routines on the same platform is quite challenging. Added to the complexity is the fact that dense and sparse matrix computations have large differences in their storage and access patterns and are difficult to optimize on the same architecture. This thesis addresses this challenge and introduces a reconfigurable accelerator that supports both dense and sparse matrix computations efficiently. The reconfigurable architecture has been optimized to execute the following linear algebra routines: GEMV (Dense General Matrix Vector Multiplication), GEMM (Dense General Matrix Matrix Multiplication), TRSM (Triangular Matrix Solver), LU Decomposition, Matrix Inverse, SpMV (Sparse Matrix Vector Multiplication), SpMM (Sparse Matrix Matrix Multiplication). It is a multicore architecture where each core consists of a 2D array of processing elements (PE). The 2D array of PEs is of size 4x4 and is scheduled to perform 4x4 sized matrix updates efficiently. A sequence of such updates is used to solve a larger problem inside a core. A novel partitioned block compressed sparse data structure (PBCSC/PBCSR) is used to perform sparse kernel updates. Scalable partitioning and mapping schemes are presented that map input matrices of any given size to the multicore architecture. Design trade-offs related to the PE array dimension, size of local memory inside a core and the bandwidth between on-chip memories and the cores have been presented. An optimal core configuration is developed from this analysis. Synthesis results using a 7nm PDK show that the proposed accelerator can achieve a performance of upto 32 GOPS using a single core.Dissertation/ThesisMasters Thesis Computer Engineering 201

    이쒅 μžμ—°μ–΄ 처리 λͺ¨λΈμ„ μœ„ν•œ ν™•μž₯ν˜• 컴퓨터 μ‹œμŠ€ν…œ 섀계

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사) -- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 전기·정보곡학뢀, 2021. 2. κΉ€μž₯우.Modern neural-network (NN) accelerators have been successful by accelerating a small number of basic operations (e.g., convolution, fully-connected, feedback) comprising the specific target neural-network models (e.g., CNN, RNN). However, this approach no longer works for the emerging full-scale natural language processing (NLP)-based neural network models (e.g., Memory networks, Transformer, BERT), which consist of different combinations of complex and heterogeneous operations (e.g., self-attention, multi-head attention, large-scale feed-forward). Existing acceleration proposals cover only the proposal-specific basic operations and/or customize them for specific models only, which leads to the low performance improvement and the narrow model coverage. Therefore, an ideal NLP accelerator should first identify all performance-critical operations required by different NLP models and support them as a single accelerator to achieve a high model coverage, and can adaptively optimize its architecture to achieve the best performance for the given model. To address these scalability and model/config diversity issues, the dissertation introduces two novel projects (i.e., MnnFast and NLP-Fast) to efficiently accelerate a wide spectrum of full-scale NLP models. First, MnnFast proposes three novel optimizations to resolve three major performance problems (i.e., high memory bandwidth, heavy computation, and cache contention) in memory-augmented neural networks. Next, NLP-Fast adopts three optimization techniques to resolve the huge performance variation due to the model/config diversity in emerging NLP models. We implement both MnnFast and NLP-Fast on different hardware platforms (i.e., CPU, GPU, FPGA) and thoroughly evaluate their performance improvement on each platform.μžμ—°μ–΄ 처리의 μ€‘μš”μ„±μ΄ λŒ€λ‘λ¨μ— 따라 μ—¬λŸ¬ κΈ°μ—… 및 연ꡬ진듀은 λ‹€μ–‘ν•˜κ³  λ³΅μž‘ν•œ μ’…λ₯˜μ˜ μžμ—°μ–΄ 처리 λͺ¨λΈλ“€μ„ μ œμ‹œν•˜κ³  μžˆλ‹€. 즉 μžμ—°μ–΄ 처리 λͺ¨λΈλ“€μ€ ν˜•νƒœκ°€ λ³΅μž‘ν•΄μ§€κ³ ,둜규λͺ¨κ°€ 컀지며, μ’…λ₯˜κ°€ λ‹€μ–‘ν•΄μ§€λŠ” 양상을 보여쀀닀. λ³Έ ν•™μœ„λ…Όλ¬Έμ€ μ΄λŸ¬ν•œ μžμ—°μ–΄ 처리 λͺ¨λΈμ˜ λ³΅μž‘μ„±, ν™•μž₯μ„±, 닀양성을 ν•΄κ²°ν•˜κΈ° μœ„ν•΄ μ—¬λŸ¬ 핡심 아이디어λ₯Ό μ œμ‹œν•˜μ˜€λ‹€. 각각의 핡심 아이디어듀은 λ‹€μŒκ³Ό κ°™λ‹€. (1) λ‹€μ–‘ν•œ μ’…λ₯˜μ˜ μžμ—°μ–΄ 처리 λͺ¨λΈμ˜ μ„±λŠ₯ μ˜€λ²„ν—€λ“œ 뢄포도λ₯Ό μ•Œμ•„λ‚΄κΈ° μœ„ν•œ 정적/동적 뢄석을 μˆ˜ν–‰ν•œλ‹€. (2) μ„±λŠ₯ 뢄석을 톡해 μ•Œμ•„λ‚Έ 주된 μ„±λŠ₯ 병λͺ© μš”μ†Œλ“€μ˜ λ©”λͺ¨λ¦¬ μ‚¬μš©μ„ μ΅œμ ν™” ν•˜κΈ° μœ„ν•œ 전체둠적 λͺ¨λΈ 병렬화 κΈ°μˆ μ„ μ œμ‹œν•œλ‹€. (3) μ—¬λŸ¬ μ—°μ‚°λ“€μ˜ μ—°μ‚°λŸ‰μ„ κ°μ†Œν•˜λŠ” 기술과 μ—°μ‚°λŸ‰ κ°μ†Œλ‘œ μΈν•œ skewness 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•œ dynamic scheduler κΈ°μˆ μ„ μ œμ‹œν•œλ‹€. (4) ν˜„ μžμ—°μ–΄ 처리 λͺ¨λΈμ˜ μ„±λŠ₯ 닀양성을 ν•΄κ²°ν•˜κΈ° μœ„ν•΄ 각 λͺ¨λΈμ— μ΅œμ ν™”λœ λ””μžμΈμ„ μ œμ‹œν•˜λŠ” κΈ°μˆ μ„ μ œμ‹œν•œλ‹€. μ΄λŸ¬ν•œ 핡심 κΈ°μˆ λ“€μ€ μ—¬λŸ¬ μ’…λ₯˜μ˜ ν•˜λ“œμ›¨μ–΄ 가속기 (예: CPU, GPU, FPGA, ASIC) 에도 λ²”μš©μ μœΌλ‘œ μ‚¬μš©λ  수 있기 λ•Œλ¬Έμ— 맀우 νš¨κ³Όμ μ΄λ―€λ‘œ, μ œμ‹œλœ κΈ°μˆ λ“€μ€ μžμ—°μ–΄ 처리 λͺ¨λΈμ„ μœ„ν•œ 컴퓨터 μ‹œμŠ€ν…œ 섀계 뢄야에 κ΄‘λ²”μœ„ν•˜κ²Œ 적용될 수 μžˆλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” ν•΄λ‹Ή κΈ°μˆ λ“€μ„ μ μš©ν•˜μ—¬ CPU, GPU, FPGA 각각의 ν™˜κ²½μ—μ„œ, μ œμ‹œλœ κΈ°μˆ λ“€μ΄ λͺ¨λ‘ μœ μ˜λ―Έν•œ μ„±λŠ₯ν–₯상을 달성함을 보여쀀닀.1 INTRODUCTION 1 2 Background 6 2.1 Memory Networks 6 2.2 Deep Learning for NLP 9 3 A Fast and Scalable System Architecture for Memory-Augmented Neural Networks 14 3.1 Motivation & Design Goals 14 3.1.1 Performance Problems in MemNN - High Off-chip Memory Bandwidth Requirements 15 3.1.2 Performance Problems in MemNN - High Computation 16 3.1.3 Performance Problems in MemNN - Shared Cache Contention 17 3.1.4 Design Goals 18 3.2 MnnFast 19 3.2.1 Column-Based Algorithm 19 3.2.2 Zero Skipping 22 3.2.3 Embedding Cache 25 3.3 Implementation 26 3.3.1 General-Purpose Architecture - CPU 26 3.3.2 General-Purpose Architecture - GPU 28 3.3.3 Custom Hardware (FPGA) 29 3.4 Evaluation 31 3.4.1 Experimental Setup 31 3.4.2 CPU 33 3.4.3 GPU 35 3.4.4 FPGA 37 3.4.5 Comparison Between CPU and FPGA 39 3.5 Conclusion 39 4 A Fast, Scalable, and Flexible System for Large-Scale Heterogeneous NLP Models 40 4.1 Motivation & Design Goals 40 4.1.1 High Model Complexity 40 4.1.2 High Memory Bandwidth 41 4.1.3 Heavy Computation 42 4.1.4 Huge Performance Variation 43 4.1.5 Design Goals 43 4.2 NLP-Fast 44 4.2.1 Bottleneck Analysis of NLP Models 44 4.2.2 Holistic Model Partitioning 47 4.2.3 Cross-operation Zero Skipping 51 4.2.4 Adaptive Hardware Reconfiguration 54 4.3 NLP-Fast Toolkit 56 4.4 Implementation 59 4.4.1 General-Purpose Architecture - CPU 59 4.4.2 General-Purpose Architecture - GPU 61 4.4.3 Custom Hardware (FPGA) 62 4.5 Evaluation 64 4.5.1 Experimental Setup 65 4.5.2 CPU 65 4.5.3 GPU 67 4.5.4 FPGA 69 4.6 Conclusion 72 5 Related Work 73 5.1 Various DNN Accelerators 73 5.2 Various NLP Accelerators 74 5.3 Model Partitioning 75 5.4 Approximation 76 5.5 Improving Flexibility 78 5.6 Resource Optimization 78 6 Conclusion 80 Abstract (In Korean) 106Docto

    Modeling the Resource Requirements of Convolutional Neural Networks on Mobile Devices

    Full text link
    Convolutional Neural Networks (CNNs) have revolutionized the research in computer vision, due to their ability to capture complex patterns, resulting in high inference accuracies. However, the increasingly complex nature of these neural networks means that they are particularly suited for server computers with powerful GPUs. We envision that deep learning applications will be eventually and widely deployed on mobile devices, e.g., smartphones, self-driving cars, and drones. Therefore, in this paper, we aim to understand the resource requirements (time, memory) of CNNs on mobile devices. First, by deploying several popular CNNs on mobile CPUs and GPUs, we measure and analyze the performance and resource usage for every layer of the CNNs. Our findings point out the potential ways of optimizing the performance on mobile devices. Second, we model the resource requirements of the different CNN computations. Finally, based on the measurement, pro ling, and modeling, we build and evaluate our modeling tool, Augur, which takes a CNN configuration (descriptor) as the input and estimates the compute time and resource usage of the CNN, to give insights about whether and how e ciently a CNN can be run on a given mobile platform. In doing so Augur tackles several challenges: (i) how to overcome pro ling and measurement overhead; (ii) how to capture the variance in different mobile platforms with different processors, memory, and cache sizes; and (iii) how to account for the variance in the number, type and size of layers of the different CNN configurations
    • …
    corecore