99 research outputs found

    Reconfigurable acceleration of Recurrent Neural Networks

    Get PDF
    Recurrent Neural Networks (RNNs) have been successful in a wide range of applications involving temporal sequences such as natural language processing, speech recognition and video analysis. However, RNNs often require a significant amount of memory and computational resources. In addition, the recurrent nature and data dependencies in RNN computations can lead to system stall, resulting in low throughput and high latency. This work describes novel parallel hardware architectures for accelerating RNN inference using Field-Programmable Gate Array (FPGA) technology, which considers the data dependencies and high computational costs of RNNs. The first contribution of this thesis is a latency-hiding architecture that utilizes column-wise matrix-vector multiplication instead of the conventional row-wise operation to eliminate data dependencies and improve the throughput of RNN inference designs. This architecture is further enhanced by a configurable checkerboard tiling strategy which allows large dimensions of weight matrices, while supporting element-based parallelism and vector-based parallelism. The presented reconfigurable RNN designs show significant speedup over CPU, GPU, and other FPGA designs. The second contribution of this thesis is a weight reuse approach for large RNN models with weights stored in off-chip memory, running with a batch size of one. A novel blocking-batching strategy is proposed to optimize the throughput of large RNN designs on FPGAs by reusing the RNN weights. Performance analysis is also introduced to enable FPGA designs to achieve the best trade-off between area, power consumption and performance. Promising power efficiency improvement has been achieved in addition to speeding up over CPU and GPU designs. The third contribution of this thesis is a low latency design for RNNs based on a partially-folded hardware architecture. It also introduces a technique that balances initiation interval of multi-layer RNN inferences to increase hardware efficiency and throughput while reducing latency. The approach is evaluated on a variety of applications, including gravitational wave detection and Bayesian RNN-based ECG anomaly detection. To facilitate the use of this approach, we open source an RNN template which enables the generation of low-latency FPGA designs with efficient resource utilization using high-level synthesis tools.Open Acces

    Multi-LSTM Acceleration and CNN Fault Tolerance

    Get PDF
    This thesis addresses the following two problems related to the field of Machine Learning: the acceleration of multiple Long Short Term Memory (LSTM) models on FPGAs and the fault tolerance of compressed Convolutional Neural Networks (CNN). LSTMs represent an effective solution to capture long-term dependencies in sequential data, like sentences in Natural Language Processing applications, video frames in Scene Labeling tasks or temporal series in Time Series Forecasting. In order to further boost their efficacy, especially in presence of long sequences, multiple LSTM models are utilized in a Hierarchical and Stacked fashion. However, because of their memory-bounded nature, efficient mapping of multiple LSTMs on a computing device becomes even more challenging. The first part of this thesis addresses the problem of mapping multiple LSTM models to a FPGA device by introducing a framework that modifies their memory requirements according to the target architecture. For the similar accuracy loss, the proposed framework maps multiple LSTMs with a performance improvement of 3x to 5x over state-of-the-art approaches. In the second part of this thesis, we investigate the fault tolerance of CNNs, another effective deep learning architecture. CNNs represent a dominating solution in image classification tasks, but suffer from a high performance cost, due to their computational structure. In fact, due to their large parameter space, fetching their data from main memory typically becomes a performance bottleneck. In order to tackle the problem, various techniques for their parameters compression have been developed, such as weight pruning, weight clustering and weight quantization. However, reducing the memory footprint of an application can lead to its data becoming more sensitive to faults. For this thesis work, we have conducted an analysis to verify the conditions for applying OddECC, a mechanism that supports variable strength and size ECCs for different memory regions. Our experiments reveal that compressed CNNs, which have their memory footprint reduced up to 86.3x by utilizing the aforementioned compression schemes, exhibit accuracy drops up to 13.56% in presence of random single bit faults

    RNN-Based Radio Resource Management on Multicore RISC-V Accelerator Architectures

    Get PDF
    Radio resource management (RRM) is critical in 5G mobile communications due to its ubiquity on every radio device and its low latency constraints. The rapidly evolving RRM algorithms with low latency requirements combined with the dense and massive 5G base station deployment ask for an on-the-edge RRM acceleration system with a tradeoff between flexibility, efficiency, and cost-making application-specific instruction-set processors (ASIPs) an optimal choice. In this work, we start from a baseline, simple RISC-V core and introduce instruction extensions coupled with software optimizations for maximizing the throughput of a selected set of recently proposed RRM algorithms based on models using multilayer perceptrons (MLPs) and recurrent neural networks (RNNs). Furthermore, we scale from a single-ASIP to a multi-ASIP acceleration system to further improve RRM throughput. For the single-ASIP system, we demonstrate an energy efficiency of 218 GMAC/s/W and a throughput of 566 MMAC/s corresponding to an improvement of 10x and 10.6x, respectively, over the single-core system with a baseline RV32IMC core. For the multi-ASIP system, we analyze the parallel speedup dependency on the input and output feature map (FM) size for fully connected and LSTM layers, achieving up to 10.2x speedup with 16 cores over a single extended RI5CY core for single LSTM layers and a speedup of 13.8x for a single fully connected layer. On the full RRM benchmark suite, we achieve an average overall speedup of 16.4x, 25.2x, 31.9x, and 38.8x on two, four, eight, and 16 cores, respectively, compared to our single-core RV32IMC baseline implementation

    이쒅 μžμ—°μ–΄ 처리 λͺ¨λΈμ„ μœ„ν•œ ν™•μž₯ν˜• 컴퓨터 μ‹œμŠ€ν…œ 섀계

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사) -- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 전기·정보곡학뢀, 2021. 2. κΉ€μž₯우.Modern neural-network (NN) accelerators have been successful by accelerating a small number of basic operations (e.g., convolution, fully-connected, feedback) comprising the specific target neural-network models (e.g., CNN, RNN). However, this approach no longer works for the emerging full-scale natural language processing (NLP)-based neural network models (e.g., Memory networks, Transformer, BERT), which consist of different combinations of complex and heterogeneous operations (e.g., self-attention, multi-head attention, large-scale feed-forward). Existing acceleration proposals cover only the proposal-specific basic operations and/or customize them for specific models only, which leads to the low performance improvement and the narrow model coverage. Therefore, an ideal NLP accelerator should first identify all performance-critical operations required by different NLP models and support them as a single accelerator to achieve a high model coverage, and can adaptively optimize its architecture to achieve the best performance for the given model. To address these scalability and model/config diversity issues, the dissertation introduces two novel projects (i.e., MnnFast and NLP-Fast) to efficiently accelerate a wide spectrum of full-scale NLP models. First, MnnFast proposes three novel optimizations to resolve three major performance problems (i.e., high memory bandwidth, heavy computation, and cache contention) in memory-augmented neural networks. Next, NLP-Fast adopts three optimization techniques to resolve the huge performance variation due to the model/config diversity in emerging NLP models. We implement both MnnFast and NLP-Fast on different hardware platforms (i.e., CPU, GPU, FPGA) and thoroughly evaluate their performance improvement on each platform.μžμ—°μ–΄ 처리의 μ€‘μš”μ„±μ΄ λŒ€λ‘λ¨μ— 따라 μ—¬λŸ¬ κΈ°μ—… 및 연ꡬ진듀은 λ‹€μ–‘ν•˜κ³  λ³΅μž‘ν•œ μ’…λ₯˜μ˜ μžμ—°μ–΄ 처리 λͺ¨λΈλ“€μ„ μ œμ‹œν•˜κ³  μžˆλ‹€. 즉 μžμ—°μ–΄ 처리 λͺ¨λΈλ“€μ€ ν˜•νƒœκ°€ λ³΅μž‘ν•΄μ§€κ³ ,둜규λͺ¨κ°€ 컀지며, μ’…λ₯˜κ°€ λ‹€μ–‘ν•΄μ§€λŠ” 양상을 보여쀀닀. λ³Έ ν•™μœ„λ…Όλ¬Έμ€ μ΄λŸ¬ν•œ μžμ—°μ–΄ 처리 λͺ¨λΈμ˜ λ³΅μž‘μ„±, ν™•μž₯μ„±, 닀양성을 ν•΄κ²°ν•˜κΈ° μœ„ν•΄ μ—¬λŸ¬ 핡심 아이디어λ₯Ό μ œμ‹œν•˜μ˜€λ‹€. 각각의 핡심 아이디어듀은 λ‹€μŒκ³Ό κ°™λ‹€. (1) λ‹€μ–‘ν•œ μ’…λ₯˜μ˜ μžμ—°μ–΄ 처리 λͺ¨λΈμ˜ μ„±λŠ₯ μ˜€λ²„ν—€λ“œ 뢄포도λ₯Ό μ•Œμ•„λ‚΄κΈ° μœ„ν•œ 정적/동적 뢄석을 μˆ˜ν–‰ν•œλ‹€. (2) μ„±λŠ₯ 뢄석을 톡해 μ•Œμ•„λ‚Έ 주된 μ„±λŠ₯ 병λͺ© μš”μ†Œλ“€μ˜ λ©”λͺ¨λ¦¬ μ‚¬μš©μ„ μ΅œμ ν™” ν•˜κΈ° μœ„ν•œ 전체둠적 λͺ¨λΈ 병렬화 κΈ°μˆ μ„ μ œμ‹œν•œλ‹€. (3) μ—¬λŸ¬ μ—°μ‚°λ“€μ˜ μ—°μ‚°λŸ‰μ„ κ°μ†Œν•˜λŠ” 기술과 μ—°μ‚°λŸ‰ κ°μ†Œλ‘œ μΈν•œ skewness 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•œ dynamic scheduler κΈ°μˆ μ„ μ œμ‹œν•œλ‹€. (4) ν˜„ μžμ—°μ–΄ 처리 λͺ¨λΈμ˜ μ„±λŠ₯ 닀양성을 ν•΄κ²°ν•˜κΈ° μœ„ν•΄ 각 λͺ¨λΈμ— μ΅œμ ν™”λœ λ””μžμΈμ„ μ œμ‹œν•˜λŠ” κΈ°μˆ μ„ μ œμ‹œν•œλ‹€. μ΄λŸ¬ν•œ 핡심 κΈ°μˆ λ“€μ€ μ—¬λŸ¬ μ’…λ₯˜μ˜ ν•˜λ“œμ›¨μ–΄ 가속기 (예: CPU, GPU, FPGA, ASIC) 에도 λ²”μš©μ μœΌλ‘œ μ‚¬μš©λ  수 있기 λ•Œλ¬Έμ— 맀우 νš¨κ³Όμ μ΄λ―€λ‘œ, μ œμ‹œλœ κΈ°μˆ λ“€μ€ μžμ—°μ–΄ 처리 λͺ¨λΈμ„ μœ„ν•œ 컴퓨터 μ‹œμŠ€ν…œ 섀계 뢄야에 κ΄‘λ²”μœ„ν•˜κ²Œ 적용될 수 μžˆλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” ν•΄λ‹Ή κΈ°μˆ λ“€μ„ μ μš©ν•˜μ—¬ CPU, GPU, FPGA 각각의 ν™˜κ²½μ—μ„œ, μ œμ‹œλœ κΈ°μˆ λ“€μ΄ λͺ¨λ‘ μœ μ˜λ―Έν•œ μ„±λŠ₯ν–₯상을 달성함을 보여쀀닀.1 INTRODUCTION 1 2 Background 6 2.1 Memory Networks 6 2.2 Deep Learning for NLP 9 3 A Fast and Scalable System Architecture for Memory-Augmented Neural Networks 14 3.1 Motivation & Design Goals 14 3.1.1 Performance Problems in MemNN - High Off-chip Memory Bandwidth Requirements 15 3.1.2 Performance Problems in MemNN - High Computation 16 3.1.3 Performance Problems in MemNN - Shared Cache Contention 17 3.1.4 Design Goals 18 3.2 MnnFast 19 3.2.1 Column-Based Algorithm 19 3.2.2 Zero Skipping 22 3.2.3 Embedding Cache 25 3.3 Implementation 26 3.3.1 General-Purpose Architecture - CPU 26 3.3.2 General-Purpose Architecture - GPU 28 3.3.3 Custom Hardware (FPGA) 29 3.4 Evaluation 31 3.4.1 Experimental Setup 31 3.4.2 CPU 33 3.4.3 GPU 35 3.4.4 FPGA 37 3.4.5 Comparison Between CPU and FPGA 39 3.5 Conclusion 39 4 A Fast, Scalable, and Flexible System for Large-Scale Heterogeneous NLP Models 40 4.1 Motivation & Design Goals 40 4.1.1 High Model Complexity 40 4.1.2 High Memory Bandwidth 41 4.1.3 Heavy Computation 42 4.1.4 Huge Performance Variation 43 4.1.5 Design Goals 43 4.2 NLP-Fast 44 4.2.1 Bottleneck Analysis of NLP Models 44 4.2.2 Holistic Model Partitioning 47 4.2.3 Cross-operation Zero Skipping 51 4.2.4 Adaptive Hardware Reconfiguration 54 4.3 NLP-Fast Toolkit 56 4.4 Implementation 59 4.4.1 General-Purpose Architecture - CPU 59 4.4.2 General-Purpose Architecture - GPU 61 4.4.3 Custom Hardware (FPGA) 62 4.5 Evaluation 64 4.5.1 Experimental Setup 65 4.5.2 CPU 65 4.5.3 GPU 67 4.5.4 FPGA 69 4.6 Conclusion 72 5 Related Work 73 5.1 Various DNN Accelerators 73 5.2 Various NLP Accelerators 74 5.3 Model Partitioning 75 5.4 Approximation 76 5.5 Improving Flexibility 78 5.6 Resource Optimization 78 6 Conclusion 80 Abstract (In Korean) 106Docto
    • …
    corecore