134 research outputs found

    WideใƒปDeepใƒขใƒ‡ใƒซใ‚’็”จใ„ใŸๆฉŸๆขฐๅญฆ็ฟ’ใ‚’้ซ˜้€ŸๅŒ–ใ™ใ‚‹ใŸใ‚ใฎใ‚ขใƒซใ‚ดใƒชใ‚บใƒ 

    Get PDF
    ไบฌ้ƒฝๅคงๅญฆๆ–ฐๅˆถใƒป่ชฒ็จ‹ๅšๅฃซๅšๅฃซ(ๆƒ…ๅ ฑๅญฆ)็”ฒ็ฌฌ23310ๅทๆƒ…ๅš็ฌฌ746ๅทๆ–ฐๅˆถ||ๆƒ…||127(้™„ๅฑžๅ›ณๆ›ธ้คจ)ไบฌ้ƒฝๅคงๅญฆๅคงๅญฆ้™ขๆƒ…ๅ ฑๅญฆ็ ”็ฉถ็ง‘็Ÿฅ่ƒฝๆƒ…ๅ ฑๅญฆๅฐ‚ๆ”ป(ไธปๆŸป)ๆ•™ๆŽˆ ้นฟๅณถ ไน…ๅ—ฃ, ๆ•™ๆŽˆ ็”ฐไธญ ๅˆฉๅนธ, ๆ•™ๆŽˆ ๅฑฑไธ‹ ไฟก้›„ๅญฆไฝ่ฆๅ‰‡็ฌฌ4ๆก็ฌฌ1้ …่ฉฒๅฝ“Doctor of InformaticsKyoto UniversityDFA

    Secure Shapley Value for Cross-Silo Federated Learning

    Full text link
    The Shapley value (SV) is a fair and principled metric for contribution evaluation in cross-silo federated learning (cross-silo FL), wherein organizations, i.e., clients, collaboratively train prediction models with the coordination of a parameter server. However, existing SV calculation methods for FL assume that the server can access the raw FL models and public test data. This may not be a valid assumption in practice considering the emerging privacy attacks on FL models and the fact that test data might be clients' private assets. Hence, we investigate the problem of secure SV calculation for cross-silo FL. We first propose HESV, a one-server solution based solely on homomorphic encryption (HE) for privacy protection, which has limitations in efficiency. To overcome these limitations, we propose SecSV, an efficient two-server protocol with the following novel features. First, SecSV utilizes a hybrid privacy protection scheme to avoid ciphertext--ciphertext multiplications between test data and models, which are extremely expensive under HE. Second, an efficient secure matrix multiplication method is proposed for SecSV. Third, SecSV strategically identifies and skips some test samples without significantly affecting the evaluation accuracy. Our experiments demonstrate that SecSV is 7.2-36.6 times as fast as HESV, with a limited loss in the accuracy of calculated SVs.Comment: Extened report for our VLDB 2023 pape

    End-to-End Neural Network-based Speech Recognition for Mobile and Embedded Devices

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2020. 8. ์„ฑ์›์šฉ.Real-time automatic speech recognition (ASR) on mobile and embedded devices has been of great interest in recent years. Deep neural network-based automatic speech recognition demands a large number of computations, while the memory bandwidth and power storage of mobile devices are limited. The server-based implementation is often employed, but this increases latency or privacy concerns. Therefore, the need of the on-device ASR system is increasing. Recurrent neural networks (RNNs) are often used for the ASR model. The RNN implementation on embedded devices can suffer from excessive DRAM accesses, because the parameter size of a neural network usually exceeds that of the cache memory. Also, the parameters of RNN cannot be reused for multiple time-steps due to its feedback structure. To solve this problem, multi-time step parallelizable models are applied for speech recognition. The multi-time step parallelization approach computes multiple output samples at a time with the parameters fetched from the DRAM. Since the number of DRAM accesses can be reduced in proportion to the number of parallelization steps, a high processing speed can be achieved for the parallelizable model. In this thesis, a connectionist temporal classification (CTC) model is constructed by combining simple recurrent units (SRUs) and depth-wise 1-dimensional convolution layers for multi-time step parallelization. Both the character and word piece models are developed for the CTC model, and the corresponding RNN based language models are used for beam search decoding. A competitive WER for WSJ corpus is achieved using the entire model size of approximately 15MB. The system operates in real-time speed using only a single core ARM without GPU or special hardware. A low-latency on-device speech recognition system with a simple gated convolutional network (SGCN) is also proposed. The SGCN shows a competitive recognition accuracy even with 1M parameters. 8-bit quantization is applied to reduce the memory size and computation time. The proposed system features an online recognition with a 0.4s latency limit and operates in 0.2 RTF with only a single 900MHz CPU core. In addition, an attention-based model with the depthwise convolutional encoder is proposed. Convolutional encoders enable faster training and inference of attention models than recurrent neural network-based ones. However, convolutional models often require a very large receptive field to achieve high recognition accuracy, which not only increases the parameter size but also the computational cost and run-time memory footprint. A convolutional encoder with a short receptive field length often suffers from looping or skipping problems. We believe that this is due to the time-invariance of convolutions. We attempt to remedy this issue by adding positional information to the convolution-based encoder. It is shown that the word error rate (WER) of a convolutional encoder with a short receptive field size can be reduced significantly by augmenting it with positional information. Visualization results are presented to demonstrate the effectiveness of incorporating positional information. The streaming end-to-end ASR model is also developed by applying monotonic chunkwise attention.์ตœ๊ทผ ๋ชจ๋ฐ”์ผ ๋ฐ ์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ์—์„œ ์‹ค์‹œ๊ฐ„ ๋™์ž‘ํ•˜๋Š” ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์„ ๊ฐœ๋ฐœํ•˜๋Š” ๊ฒƒ์ด ํฐ ๊ด€์‹ฌ์„ ๋ฐ›๊ณ  ์žˆ๋‹ค. ๊นŠ์€ ์ธ๊ณต ์‹ ๊ฒฝ๋ง ์Œ์„ฑ์ธ์‹์€ ๋งŽ์€ ์–‘์˜ ์—ฐ์‚ฐ์„ ํ•„์š”๋กœ ํ•˜๋Š” ๋ฐ˜๋ฉด, ๋ชจ๋ฐ”์ผ ๊ธฐ๊ธฐ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์ด๋‚˜ ์ „๋ ฅ์€ ์ œํ•œ๋˜์–ด ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ํ•œ๊ณ„ ๋•Œ๋ฌธ์— ์„œ๋ฒ„ ๊ธฐ๋ฐ˜ ๊ตฌํ˜„์ด ๋ณดํ†ต ์‚ฌ์šฉ๋˜์–ด์ง€์ง€๋งŒ, ์ด๋Š” ์ง€์—ฐ ์‹œ๊ฐ„ ๋ฐ ์‚ฌ์ƒํ™œ ์นจํ•ด ๋ฌธ์ œ๋ฅผ ์ผ์œผํ‚จ๋‹ค. ๋”ฐ๋ผ์„œ ๋ชจ๋ฐ”์ผ ๊ธฐ๊ธฐ ์ƒ ๋™์ž‘ํ•˜๋Š” ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์˜ ์š”๊ตฌ๊ฐ€ ์ปค์ง€๊ณ  ์žˆ๋‹ค. ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์— ์ฃผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๋ชจ๋ธ์€ ์žฌ๊ท€ํ˜• ์ธ๊ณต ์‹ ๊ฒฝ๋ง์ด๋‹ค. ์žฌ๊ท€ํ˜• ์ธ๊ณต ์‹ ๊ฒฝ๋ง์˜ ๋ชจ๋ธ ํฌ๊ธฐ๋Š” ๋ณดํ†ต ์บ์‹œ์˜ ํฌ๊ธฐ๋ณด๋‹ค ํฌ๊ณ  ํ”ผ๋“œ๋ฐฑ ๊ตฌ์กฐ ๋•Œ๋ฌธ์— ์žฌ์‚ฌ์šฉ์ด ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์— ๋งŽ์€ DRAM ์ ‘๊ทผ์„ ํ•„์š”๋กœ ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์ค‘ ์‹œ๊ฐ„์˜ ์ž…๋ ฅ์—๋Œ€ํ•ด ๋ณ‘๋ ฌํ™” ๊ฐ€๋Šฅํ•œ ๋ชจ๋ธ์„ ์ด์šฉํ•œ ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์„ ์ œ์•ˆํ•œ๋‹ค. ๋‹ค์ค‘ ์‹œ๊ฐ„ ๋ณ‘๋ ฌํ™” ๊ธฐ๋ฒ•์€ ํ•œ ๋ฒˆ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์œผ๋กœ ์—ฌ๋Ÿฌ ์‹œ๊ฐ„์˜ ์ถœ๋ ฅ์„ ๋™์‹œ์— ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ๋ณ‘๋ ฌํ™” ์ˆ˜์— ๋”ฐ๋ผ DRAM ์ ‘๊ทผ ํšŸ์ˆ˜๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋ณ‘๋ ฌํ™” ๊ฐ€๋Šฅํ•œ ๋ชจ๋ธ์— ๋Œ€ํ•˜์—ฌ ๋น ๋ฅธ ์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ๋‹จ์ˆœ ์žฌ๊ท€ ์œ ๋‹›๊ณผ 1์ฐจ์› ์ปจ๋ฒŒ๋ฃจ์…˜์„ ์ด์šฉํ•œ CTC ๋ชจ๋ธ์„ ์ œ์‹œํ•˜์˜€๋‹ค. ๋ฌธ์ž์™€ ๋‹จ์–ด ์กฐ๊ฐ ์ˆ˜์ค€์˜ ๋ชจ๋ธ์ด ๊ฐœ๋ฐœ๋˜์—ˆ๋‹ค. ๊ฐ ์ถœ๋ ฅ ๋‹จ์œ„์— ํ•ด๋‹นํ•˜๋Š” ์žฌ๊ท€ํ˜• ์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜ ์–ธ์–ด ๋ชจ๋ธ์„ ์ด์šฉํ•˜์—ฌ ๋””์ฝ”๋”ฉ์— ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ์ „์ฒด 15MB์˜ ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ๋กœ WSJ ์—์„œ ๋†’์€ ์ˆ˜์ค€์˜ ์ธ์‹ ์„ฑ๋Šฅ์„ ์–ป์—ˆ์œผ๋ฉฐ GPU๋‚˜ ๊ธฐํƒ€ ํ•˜๋“œ์›จ์–ด ์—†์ด 1๊ฐœ์˜ ARM CPU ์ฝ”์–ด๋กœ ์‹ค์‹œ๊ฐ„ ์ฒ˜๋ฆฌ๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ๋˜ํ•œ ๋‹จ์ˆœ ์ปจ๋ฒŒ๋ฃจ์…˜ ์ธ๊ณต ์‹ ๊ฒฝ๋ง (SGCN)์„ ์ด์šฉํ•œ ๋‚ฎ์€ ์ง€์—ฐ์‹œ๊ฐ„์„ ๊ฐ€์ง€๋Š” ์Œ์„ฑ์ธ์‹ ์‹œ์Šคํ…œ์„ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค. SGCN์€ 1M์˜ ๋งค์šฐ ๋‚ฎ์€ ๋ณ€์ˆ˜ ๊ฐฏ์ˆ˜๋กœ๋„ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์ธ์‹ ์ •ํ™•๋„๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ 8-bit ์–‘์žํ™”๋ฅผ ์ ์šฉํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ์™€ ์—ฐ์‚ฐ ์‹œ๊ฐ„์„ ๊ฐ์†Œ ์‹œ์ผฐ๋‹ค. ํ•ด๋‹น ์‹œ์Šคํ…œ์€ 0.4์ดˆ์˜ ์ด๋ก ์  ์ง€์—ฐ์‹œ๊ฐ„์„ ๊ฐ€์ง€๋ฉฐ 900MHz์˜ CPU ์ƒ์—์„œ 0.2์˜ RTF๋กœ ๋™์ž‘ํ•˜์˜€๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ, ๊นŠ์ด๋ณ„ ์ปจ๋ฒŒ๋ฃจ์…˜ ์ธ์ฝ”๋”๋ฅผ ์ด์šฉํ•œ ์–ดํ…์…˜ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ด ๊ฐœ๋ฐœ๋˜์—ˆ๋‹ค. ์ปจ๋ฒŒ๋ฃจ์…˜ ๊ธฐ๋ฐ˜์˜ ์ธ์ฝ”๋”๋Š” ์žฌ๊ท€ํ˜• ์ธ๊ณต ์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋ณด๋‹ค ๋น ๋ฅธ ์ฒ˜๋ฆฌ ์†๋„๋ฅผ ๊ฐ€์ง„๋‹ค. ํ•˜์ง€๋งŒ ์ปจ๋ฒŒ๋ฃจ์…˜ ๋ชจ๋ธ์€ ๋†’์€ ์„ฑ๋Šฅ์„ ์œ„ํ•ด์„œ ํฐ ์ž…๋ ฅ ๋ฒ”์œ„๋ฅผ ํ•„์š”๋กœ ํ•œ๋‹ค. ์ด๋Š” ๋ชจ๋ธ ํฌ๊ธฐ ๋ฐ ์—ฐ์‚ฐ๋Ÿ‰, ๊ทธ๋ฆฌ๊ณ  ๋™์ž‘ ์‹œ ๋ฉ”๋ชจ๋ฆฌ ์†Œ๋ชจ๋ฅผ ์ฆ๊ฐ€ ์‹œํ‚จ๋‹ค. ์ž‘์€ ํฌ๊ธฐ์˜ ์ž…๋ ฅ ๋ฒ”์œ„๋ฅผ ๊ฐ€์ง€๋Š” ์ปจ๋ฒŒ๋ฃจ์…˜ ์ธ์ฝ”๋”๋Š” ์ถœ๋ ฅ์˜ ๋ฐ˜๋ณต์ด๋‚˜ ์ƒ๋žต์œผ๋กœ ์ธํ•˜์—ฌ ๋†’์€ ์˜ค์ฐจ์œจ์„ ๊ฐ€์ง„๋‹ค. ์ด๊ฒƒ์€ ์ปจ๋ฒŒ๋ฃจ์…˜์˜ ์‹œ๊ฐ„ ๋ถˆ๋ณ€์„ฑ ๋•Œ๋ฌธ์œผ๋กœ ์—ฌ๊ฒจ์ง€๋ฉฐ, ์ด ๋ฌธ์ œ๋ฅผ ์œ„์น˜ ์ธ์ฝ”๋”ฉ ๋ฒกํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ํ•ด๊ฒฐํ•˜์˜€๋‹ค. ์œ„์น˜ ์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ์ž‘์€ ํฌ๊ธฐ์˜ ํ•„ํ„ฐ๋ฅผ ๊ฐ€์ง€๋Š” ์ปจ๋ฒŒ๋ฃจ์…˜ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๋†’์ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์˜€๋‹ค. ๋˜ํ•œ ์œ„์น˜ ์ •๋ณด๊ฐ€ ๊ฐ€์ง€๋Š” ์˜ํ–ฅ์„ ์‹œ๊ฐํ™” ํ•˜์˜€๋‹ค. ํ•ด๋‹น ๋ฐฉ๋ฒ•์„ ๋‹จ์กฐ ์–ดํ…์…˜์„ ์ด์šฉํ•œ ๋ชจ๋ธ์— ํ™œ์šฉํ•˜์—ฌ ์ปจ๋ฒŒ๋ฃจ์…˜ ๊ธฐ๋ฐ˜์˜ ์ŠคํŠธ๋ฆฌ๋ฐ ๊ฐ€๋Šฅํ•œ ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์„ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค.1 Introduction 1 1.1 End-to-End Automatic Speech Recognition with Neural Networks . . 1 1.2 Challenges on On-device Implementation of Neural Network-based ASR 2 1.3 Parallelizable Neural Network Architecture 3 1.4 Scope of Dissertation 3 2 Simple Recurrent Units for CTC-based End-to-End Speech Recognition 6 2.1 Introduction 6 2.2 Related Works 8 2.3 Speech Recognition Algorithm 9 2.3.1 Acoustic modeling 10 2.3.2 Character-based model 12 2.3.3 Word piece-based model 14 2.3.4 Decoding 14 2.4 Experimental Results 15 2.4.1 Acoustic models 15 2.4.2 Word piece based speech recognition 22 2.4.3 Execution time analysis 25 2.5 Concluding Remarks 27 3 Low-Latency Lightweight Streaming Speech Recognition with 8-bit Quantized Depthwise Gated Convolutional Neural Networks 28 3.1 Introduction 28 3.2 Simple Gated Convolutional Networks 30 3.2.1 Model structure 30 3.2.2 Multi-time-step parallelization 31 3.3 Training CTC AM with SGCN 34 3.3.1 Regularization with symmetrical weight noise injection 34 3.3.2 8-bit quantization 34 3.4 Experimental Results 36 3.4.1 Experimental setting 36 3.4.2 Results on WSJ eval92 38 3.4.3 Implementation on the embedded system 38 3.5 Concluding Remarks 39 4 Effect of Adding Positional Information on Convolutional Neural Networks for End-to-End Speech Recognition 41 4.1 Introduction 41 4.2 Related Works 43 4.3 Model Description 45 4.4 Experimental Results 46 4.4.1 Effect of receptive field size 46 4.4.2 Visualization 49 4.4.3 Comparison with other models 53 4.5 Concluding Remarks 53 5 Convolution-based Attention Model with Positional Encoding for Streaming Speech Recognition 55 5.1 Introduction 55 5.2 Related Works 58 5.3 End-to-End Model for Speech Recognition 61 5.3.1 Model description 61 5.3.2 Monotonic chunkwise attention 62 5.3.3 Positional encoding 63 5.4 Experimental Results 64 5.4.1 Effect of positional encoding 66 5.4.2 Comparison with other models 68 5.4.3 Execution time analysis 70 5.5 Concluding Remarks 71 6 Conclusion 72 Abstract (In Korean) 86Docto

    Compressing Neural Machine Translation Models with 4-bit Precision

    Get PDF

    Effective attention-based sequence-to-sequence modelling for automatic speech recognition

    Get PDF
    With sufficient training data, attentional encoder-decoder models have given outstanding ASR results. In such models, the encoder encodes the input sequence into a sequence of hidden representations. The attention mechanism generates a soft alignment between the encoder hidden states and the decoder hidden states. The decoder produces the current output by considering the alignment and the previous outputs. However, attentional encoder-decoder models are originally designed for machine translation tasks, where the input and output sequences are relatively short and the alignments between them are flexible. For ASR tasks, the input sequences are notably long. Further, acoustic frames (or their hidden representations) typically can be aligned with output units in a left-to-right order, and compared to the length of the entire utterance, the duration of each output unit is usually small. Conventional encoder-decoder models have difficulties in modelling long sequences, and the attention mechanism does not guarantee the monotonic left-to-right alignments. In this thesis, we study attention-based sequence-to-sequence ASR models and address the aforementioned issues. We investigate recurrent neural network (RNN) encoder-decoder models and self-attention encoder-decoder models. For RNN encoder-decoder models, we develop a dynamic subsampling RNN (dsRNN) encoder to shorten the lengths of the input sequences. The dsRNN learns to skip redundant frames. Furthermore, the skip ratio may vary at different stages of training, thus allowing the encoder to learn the most relevant information for each epoch. Thus, the dsRNN alleviates the difficulties of encoding long sequences. We also propose a fully trainable windowed attention mechanism, in which both the window shift and window length are learned by the model. Our windowed method forces the attention mechanism to attend inputs within small sliding windows in a strict left-to-right order. The proposed dsRNN and windowed attention give significant performance gains over traditional encoder-decoder ASR models. We next study self-attention encoder-decoder models. For RNN encoder-decoder models, we have shown that restricting the attention within small windows is beneficial. However, self-attention encodes input sequences by comparing each element of the sequence with all other elements of the sequence. Therefore, we investigate if the global view of self-attention is necessary for ASR. We note that the range of the learned context increases from the lower to the upper self-attention layers, and suggest that the upper encoder layers may have seen sufficient contextual information without the need for self-attention. This would imply that the upper self-attention layers can be replaced with feed-forward layers (we can view the feed-forward layers as strict local left-to-right self-attention). In practice, we observe replacing upper encoder self-attention layers with feed forward layers does not impact the performance. We also observe that there are individual attention heads that only attend local information, and thus the self-attention mechanism is redundant for these attention heads. Based on these observations, we propose randomly removing attention heads during training but keep all heads at testing. The proposed method achieves state-of-the-art ASR results on benchmark datasets of different ASR scenarios. Finally, we investigate top-down level-wise training of sequence-to-sequence ASR models. We find that when training sequence-to-sequence ASR models on noisy data, the use of upper layers trained on clean data forces the lower layers to learn noise-invariant features, since the features which fit the clean-trained upper layers are more general. We further show that within the same dataset, conventional joint training makes the upper layers quickly overfit. Therefore, we propose to freeze the upper layers and retrain the lower layers. The proposed method is a general training strategy; we use it not only to train ASR models but also to train other neural networks in other domains. The proposed training method yields consistent performance gains across different tasks (e.g., language modelling, image classification). In summary, we propose methods which enable attention-based sequence-to-sequence ASR systems to better model sequential data, and demonstrate the benefits of training neural networks in a top-down cascade manner

    Energy-Efficient Recurrent Neural Network Accelerators for Real-Time Inference

    Full text link
    Over the past decade, Deep Learning (DL) and Deep Neural Network (DNN) have gone through a rapid development. They are now vastly applied to various applications and have profoundly changed the life of hu- man beings. As an essential element of DNN, Recurrent Neural Networks (RNN) are helpful in processing time-sequential data and are widely used in applications such as speech recognition and machine translation. RNNs are difficult to compute because of their massive arithmetic operations and large memory footprint. RNN inference workloads used to be executed on conventional general-purpose processors including Central Processing Units (CPU) and Graphics Processing Units (GPU); however, they have un- necessary hardware blocks for RNN computation such as branch predictor, caching system, making them not optimal for RNN processing. To accelerate RNN computations and outperform the performance of conventional processors, previous work focused on optimization methods on both software and hardware. On the software side, previous works mainly used model compression to reduce the memory footprint and the arithmetic operations of RNNs. On the hardware side, previous works also designed domain-specific hardware accelerators based on Field Pro- grammable Gate Arrays (FPGA) or Application Specific Integrated Circuits (ASIC) with customized hardware pipelines optimized for efficient pro- cessing of RNNs. By following this software-hardware co-design strategy, previous works achieved at least 10X speedup over conventional processors. Many previous works focused on achieving high throughput with a large batch of input streams. However, in real-time applications, such as gaming Artificial Intellegence (AI), dynamical system control, low latency is more critical. Moreover, there is a trend of offloading neural network workloads to edge devices to provide a better user experience and privacy protection. Edge devices, such as mobile phones and wearable devices, are usually resource-constrained with a tight power budget. They require RNN hard- ware that is more energy-efficient to realize both low-latency inference and long battery life. Brain neurons have sparsity in both the spatial domain and time domain. Inspired by this human nature, previous work mainly explored model compression to induce spatial sparsity in RNNs. The delta network algorithm alternatively induces temporal sparsity in RNNs and can save over 10X arithmetic operations in RNNs proven by previous works. In this work, we have proposed customized hardware accelerators to exploit temporal sparsity in Gated Recurrent Unit (GRU)-RNNs and Long Short-Term Memory (LSTM)-RNNs to achieve energy-efficient real-time RNN inference. First, we have proposed DeltaRNN, the first-ever RNN accelerator to exploit temporal sparsity in GRU-RNNs. DeltaRNN has achieved 1.2 TOp/s effective throughput with a batch size of 1, which is 15X higher than its related works. Second, we have designed EdgeDRNN to accelerate GRU-RNN edge inference. Compared to DeltaRNN, EdgeDRNN does not rely on on-chip memory to store RNN weights and focuses on reducing off-chip Dynamic Random Access Memory (DRAM) data traffic using a more scalable architecture. EdgeDRNN have realized real-time inference of large GRU-RNNs with submillisecond latency and only 2.3 W wall plug power consumption, achieving 4X higher energy efficiency than commercial edge AI platforms like NVIDIA Jetson Nano. Third, we have used DeltaRNN to realize the first-ever continuous speech recognition sys- tem with the Dynamic Audio Sensor (DAS) as the front-end. The DAS is a neuromorphic event-driven sensor that produces a stream of asyn- chronous events instead of audio data sampled at a fixed sample rate. We have also showcased how an RNN accelerator can be integrated with an event-driven sensor on the same chip to realize ultra-low-power Keyword Spotting (KWS) on the extreme edge. Fourth, we have used EdgeDRNN to control a powered robotic prosthesis using an RNN controller to replace a conventional proportionalโ€“derivative (PD) controller. EdgeDRNN has achieved 21 ฮผs latency of running the RNN controller and could maintain stable control of the prosthesis. We have used DeltaRNN and EdgeDRNN to solve these problems to prove their value in solving real-world problems. Finally, we have applied the delta network algorithm on LSTM-RNNs and have combined it with a customized structured pruning method, called Column-Balanced Targeted Dropout (CBTD), to induce spatio-temporal sparsity in LSTM-RNNs. Then, we have proposed another FPGA-based accelerator called Spartus, the first RNN accelerator that exploits spatio- temporal sparsity. Spartus achieved 9.4 TOp/s effective throughput with a batch size of 1, the highest among present FPGA-based RNN accelerators with a power budget around 10 W. Spartus can complete the inference of an LSTM layer having 5 million parameters within 1 ฮผs
    • โ€ฆ
    corecore