51 research outputs found

    On-device Efficient Acoustic Modeling with Simple Gated Convolutional Networks

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2019. 2. ์„ฑ์›์šฉ.์˜ค๋Š˜๋‚ , ์ž๋™ ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์œผ๋กœ ์ธ๊ณต์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ฃผ์š”ํ•˜๊ฒŒ ํ™œ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ๊ทธ๋Ÿฐ ๊ฐ€์šด๋ฐ, ์Šค๋งˆํŠธํฐ์ด๋‚˜ ์ž„๋ฒ ๋””๋“œ ์žฅ์น˜์—์„œ ์„œ๋ฒ„๋ฅผ ๊ฑฐ์น˜์ง€ ์•Š๊ณ  ์ง„ํ–‰๋˜๋Š” ์˜จ-๋””๋ฐ”์ด์Šค ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์— ๋Œ€ํ•œ ์ˆ˜์š”๊ฐ€ ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ๋‹ค. ์˜จ-๋””๋ฐ”์ด์Šค ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์€ ์‚ฌ์šฉ์ž์˜ ์Œ์„ฑ์ด ์„œ๋น„์Šค ์ œ๊ณต์ž์˜ ์„œ๋ฒ„๋กœ ์ œ๊ณต๋˜์ง€ ์•Š๊ณ , ์Œ์„ฑ์ธ์‹์ด ์‚ฌ์šฉ์ž์˜ ์žฅ์น˜์—์„œ ๋…๋ฆฝ์ ์œผ๋กœ ์ด๋ฃจ์–ด์ง„๋‹ค. ๋”ฐ๋ผ์„œ, ํ”„๋ผ์ด๋ฒ„์‹œ ์นจํ•ด์™€ ๋ณด์•ˆ์— ๋Œ€ํ•œ ์šฐ๋ ค๋ฅผ ์ƒ๋‹น ๋ถ€๋ถ„ ํ•ด์†Œํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, ์ธ๊ณต์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜์˜ ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์—์„œ ์ฃผ๋กœ ์‚ฌ์šฉ๋˜๋Š” LSTM ๊ธฐ๋ฐ˜์˜ ํšŒ๊ท€์‹ ๊ฒฝ๋ง(RNN)์€ ์˜จ-๋””๋ฐ”์ด์Šค ์Œ์„ฑ ์ธ์‹์— ํšจ์œจ์ ์ด์ง€ ์•Š๋‹ค. LSTM RNN์€ ์‹œํ€€์Šค(sequence) ์ •๋ณด์˜ ๋ณ‘๋ ฌํ™”๊ฐ€ ์–ด๋ ต๋‹ค. ์ด๋Š” LSTM RNN์—๋Š” ํ˜„์žฌ์˜ ์‹œ๊ฐ„ ์Šคํ…(step)์ด ๊ณผ๊ฑฐ์˜ ์‹œ๊ฐ„ ์Šคํ…์— ์˜์กดํ•˜๋Š” ๋˜๋จน์ž„(Feedback) ํŠน์„ฑ์ด ์กด์žฌํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋˜, ์ด ๋˜๋จน์ž„ ์ •๋ณด๋Š” ๋„ˆ๋ฌด ์ปค์„œ ์บ์‹œ ๋ฉ”๋ชจ๋ฆฌ์— ๋“ค์–ด๊ฐˆ ์ˆ˜ ์—†๋‹ค. ๋”ฐ๋ผ์„œ, ์‹œํ€€์Šค ์ •๋ณด์˜ ๋งค ์‹œ๊ฐ„ ์Šคํ…๋งˆ๋‹ค DRAM์— ์ ‘๊ทผํ•˜์—ฌ ์ƒ˜ํ”Œ์„ ๋ถˆ๋Ÿฌ์™€์•ผ ํ•œ๋‹ค. ์ด ๊ฒฝ์šฐ ๋งค ์‹œ๊ฐ„ ์Šคํ…๋งˆ๋‹ค DRAM์— ์ ‘๊ทผํ•˜์—ฌ ์ „๋ ฅ์†Œ๋ชจ๊ฐ€ ์ฆ๊ฐ€ํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์‹คํ–‰ ์‹œ๊ฐ„๋„ ์ฆ๊ฐ€ํ•˜๊ฒŒ ๋œ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ด ๋…ผ๋ฌธ์—์„œ ์˜จ-๋””๋ฐ”์ด์Šค์— ์นœํ™”์ ์ธ ์ธ๊ณต์‹ ๊ฒฝ๋ง ๋ชจ๋ธ์„ ์ œ์‹œํ•œ๋‹ค. ์ด ๋ชจ๋ธ๋“ค์„ ์Œํ–ฅ ๋ชจ๋ธ๋ง์— ํ™œ์šฉํ•˜์—ฌ LSTM RNN์„ ๋Œ€์ฒดํ•œ๋‹ค. ๊ฒŒ์ดํ‹ฐ๋“œ ์ฝ˜๋ณผ๋ฃจ์…˜ ๋„คํฌ์›Œํฌ(Gated ConvNet), ๋Œ€๊ฐ์„ฑ๋ถ„ LSTM(Diagonal LSTM), QRNN(the quasi RNN)์ด ํ™œ์šฉ๋˜์—ˆ๋‹ค. ์ด๋“ค ๋ชจ๋ธ์€ ๋Œ€๋ถ€๋ถ„์˜ ์—ฐ์‚ฐ์—์„œ ์ˆœ์„œ ์˜์กด์„ฑ์ด ์กด์žฌ ํ•˜์ง€ ์•Š์•„ ์‹œ๊ฐ„ ์Šคํ…๋ณ„ ๋ณ‘๋ ฌํ™”๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค. \\ \\ \\ \\ ์ด๋“ค ๋ชจ๋ธ๋“ค์€ ์ž๋™ ์Œ์„ฑ ์ธ์‹์—์„œ 1์ฐจ์› ๊นŠ์ด ์ฝ˜๋ณผ๋ฃจ์…˜(1D depthwise Convolution)์ด ์ถ”๊ฐ€๋œ ํ›„์—๋Š” LSTM RNN์˜ ์„ฑ๋Šฅ์„ ํ›จ์”ฌ ๋Šฅ๊ฐ€ํ•˜์˜€๋‹ค. ํŠนํžˆ ๊ฒŒ์ดํ‹ฐ๋“œ ์ฝ˜๋ณผ๋ฃจ์…˜ ๋„คํŠธ์›Œํฌ์˜ ๊ฒฝ์šฐ ๊นŠ์€ ๊ตฌ์กฐ๋ฅผ ์ฑ„ํƒํ•˜์˜€์„ ๋•Œ, ์Œํ–ฅ ๋ชจ๋ธ ์—†์ด ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ๋ฌด์—‡๋ณด๋‹ค๋„ ์˜จ-๋””๋ฐ”์ด์Šค์— ํšจ์œจ์ ์ธ ์ธ๊ณต์‹ ๊ฒฝ๋ง ๋ชจ๋ธ๋“ค์€ ์‹œํ€€์Šค์˜ ์‹œ๊ฐ„ ์Šคํ…๋ณ„ ๋ณ‘๋ ฌํ™”๋ฅผ ํ†ตํ•ด ์‹ค์ œ ์ž„๋ฒ ๋””๋“œ ์žฅ์น˜์—์„œ LSTM RNN ๋Œ€๋น„ ์ตœ์†Œ 5๋ฐฐ์˜ ์‹คํ–‰ ์†๋„ ์ฆ๊ฐ€๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์—ฌ๊ธฐ์„œ ๋” ๋‚˜์•„๊ฐ€, ์‹ฌํ”Œ ๊ฒŒ์ดํ‹ฐ๋“œ ์ฝ˜๋ณผ๋ฃจ์…˜ ๋„คํŠธ์›Œํฌ(Simple Gated ConvNet)์„ ์ œ์‹œํ•œ๋‹ค. ์‹ฌํ”Œ ๊ฒŒ์ดํ‹ฐ๋“œ ์ฝ˜๋ณผ๋ฃจ์…˜์€ ๊ฒŒ์ดํ‹ฐ๋“œ ์ฝ˜๋ณผ๋ฃจ์…˜์˜ ๊ฐ€์žฅ ๋‹จ์ˆœํ™” ๋œ ํ˜•ํƒœ์— ๊ธฐ๋ฐ˜์„ ๋‘” ๊ฒƒ์œผ๋กœ, ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์ˆ˜๊ฐ€ ํ˜๋ช…์ ์œผ๋กœ ๊ฐ์†Œํ•œ๋‹ค. ์ด๋Š” ํ•˜๋“œ์›จ์–ด ์‚ฌ์–‘์˜ ์ œํ•œ์„ ๋ฐ›๋Š” ์˜จ-๋””๋ฐ”์ด์Šค ์Œ์„ฑ์ธ์‹์— ์œ ๋ฆฌํ•œ ํŠน์„ฑ์ด๋‹ค. ๋˜ํ•œ ์‹ฌํ”Œ ๊ฒŒ์ดํ‹ฐ๋“œ ์ฝ˜๋ณผ๋ฃจ์…˜ ๋„คํŠธ์›Œํฌ๋Š” ์‹œ๊ฐ„ ์Šคํ… ๋ณ„ ์ˆœ์„œ ์˜์กด์„ฑ์ด ์กด์žฌํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ์‹œ๊ฐ„ ์Šคํ…๋ณ„ ๋ณ‘๋ ฌํ™”๋„ ๊ฐ€๋Šฅํ•˜๋‹ค. ์šฐ๋ฆฌ๋Š” 1์ฐจ์› ๊นŠ์ด ๋ณ‘๋ ฌํ™”(1D depthwise convolution)์„ ์—ฌ๋Ÿฌ ๋ฐฉํ–ฅ์„ ์ ์šฉํ•˜์—ฌ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์ด๋Œ์–ด ๋‚ด์—ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ์šฐ๋ฆฌ๋Š” ์‹ฌํ”Œ ๊ฒŒ์ดํ‹ฐ๋“œ ์ฝ˜๋ณผ๋ฃจ์…˜ ๋„คํฌ์›Œํฌ๋ฅผ ํ™œ์šฉํ•ด ํŒŒ๋ผ๋ฏธํ„ฐ ์‚ฌ์šฉ๋Ÿ‰์„ 3 M ์ดํ•˜๋กœ ์ค„์˜€๋‹ค. ๋™์ผํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ์‹ฌํ”Œ ๊ฒŒ์ดํ‹ฐ๋“œ ์ฝ˜๋ณผ๋ฃจ์…˜ ๋„คํŠธ์›Œํฌ๋Š” ์ž๋™ ์Œ์„ฑ ์ธ์‹์—์„œ LSTM RNN์ด๋‚˜ ๊ฒŒ์ดํ‹ฐ๋“œ ์ฝ˜๋ณผ๋ฃจ์…˜ ๋„คํŠธ์›Œํฌ์˜ ์„ฑ๋Šฅ์„ ๋Šฅ๊ฐ€ํ–ˆ๋‹ค. 3 M ์•„๋ž˜์˜ ์‹ฌํ”Œ ๊ฒŒ์ดํ‹ฐ๋“œ ์ฝ˜๋ณผ๋ฃจ์…˜ ๋„คํฌ์›Œํฌ๋Š” 10 M์˜ LSTM๋ณด๋‹ค ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๊ธฐ๋„ ํ•˜์˜€๋‹ค. ๋˜ํ•œ, ์‹œ๊ฐ„ ์Šคํ… ๋ณ„ ๋ณ‘๋ ฌํ™”๋ฅผ ํ†ตํ•ด์„œ ARM CPU์—์„œ LSTM RNN ๋Œ€๋น„ 10 ๋ฐฐ์˜ ์‹คํ–‰ ์†๋„ ์ฆ๊ฐ€๋ฅผ ์–ป์–ด๋ƒˆ๋‹ค.Automatic speech recognition (ASR) is widely adopted for smartphones and many embedded devices in recent years, and neural network based algorithms show the best performance for ASR. While most of ASR systems are based on server-based processing, there is an increasing demand for on-device speech recognition because of privacy concern and low latency processing. Reducing the power consumption is especially important for on-device speech recognition to lengthen the battery life. Among several neural network models, recurrent neural network (RNN) based algorithms are mostly used for speech recognition, and long short-term memory(LSTM) RNN is most popular because of its superior performance over the other ones. However, executing LSTM RNN demands many DRAM accesses because the cache size of embedded devices is usually much smaller than the parameter size of RNN. Multi-time step parallelization technique computes multiple output samples at a time by fetching one set of parameters, and thus it can reduce the number of DRAM accesses in proportional to the number of time steps computed at a time. However, LSTM RNN does not permit the multi-time step parallelization because of complex feedback structure of the model. This thesis presents neural network models that support efficient on-device speech recognition. First, a few models that permit multi-time step parallel processing are evaluated. The models evaluated include Gated ConvNet, Diagonal LSTM, and QRNN (quasi RNN). Since the performance of these models are not as good as the LSTM, one-dimensional depthwise convolution is added to improve the performance. The one-dimensional convolution helps finding the temporal patterns of speech signal. Second, Simple Gated Convolution Network (Simple Gated ConvNet) is proposed for improved performance when the parameter count is very small. The Simple Gated ConvNet employs the simplest form of Gated ConvNet. Instead it relies on one-dimensional convolution for temporal observation. Simple Gated ConvNet supports low-power on-device speech recognition because it can be executed employing multi-time step parallelization. The Simple Gated ConvNet under 3 million even shows better performance than the LSTM with 10 million parameters. In addition, the execution speed in ARM CPU can be increased more than ten-times compared with the LSTM RNN through multi-time step parallelization.1 Introduction 1 1.1 On-device speech recognition: advantages and challenges . . . . . . 1 1.2 The components of speech recognition . . . . . . . . . . . . . . . . 3 1.3 The downsides of RNN based acoustic models . . . . . . . . . . . . . 4 1.4 Exploration of efficient on-device acoustic modeling with neural networks . . . . . . . . . . 5 1.5 Simple Gated ConvNet for small footprint acoustic modeling . . . . . 6 1.6 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Exploration of Efficient On-device Acoustic Modeling with Neural Networks 8 2.1 Acoustic Modeling Algorithms . . . . . . . . . . . . . . . . . . . . 8 2.1.1 Diagonal LSTM RNN . . . . . . . . . . . . . . . . . . . . . 8 2.1.2 QRNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.3 Gated ConvNet . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 End-to-end speech recognition . . . . . . . . . . . . . . . . . 11 2.2.2 Phoneme classification . . . . . . . . . . . . . . . . . . . . . 15 2.2.3 Implementation Results on Embedded Systems . . . . . . . . 17 2.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3 Simple Gated Convolutional Networks for small footprint acoustic modeling 20 3.1 Simple Gated ConvNet . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1.1 Gated ConvNet . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1.2 Simple Gated ConvNet . . . . . . . . . . . . . . . . . . . . . 21 3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.1 Experiment Setups . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4 Conclusions 32 Abstract (In Korean) 39Maste

    End-to-End Neural Network-based Speech Recognition for Mobile and Embedded Devices

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2020. 8. ์„ฑ์›์šฉ.Real-time automatic speech recognition (ASR) on mobile and embedded devices has been of great interest in recent years. Deep neural network-based automatic speech recognition demands a large number of computations, while the memory bandwidth and power storage of mobile devices are limited. The server-based implementation is often employed, but this increases latency or privacy concerns. Therefore, the need of the on-device ASR system is increasing. Recurrent neural networks (RNNs) are often used for the ASR model. The RNN implementation on embedded devices can suffer from excessive DRAM accesses, because the parameter size of a neural network usually exceeds that of the cache memory. Also, the parameters of RNN cannot be reused for multiple time-steps due to its feedback structure. To solve this problem, multi-time step parallelizable models are applied for speech recognition. The multi-time step parallelization approach computes multiple output samples at a time with the parameters fetched from the DRAM. Since the number of DRAM accesses can be reduced in proportion to the number of parallelization steps, a high processing speed can be achieved for the parallelizable model. In this thesis, a connectionist temporal classification (CTC) model is constructed by combining simple recurrent units (SRUs) and depth-wise 1-dimensional convolution layers for multi-time step parallelization. Both the character and word piece models are developed for the CTC model, and the corresponding RNN based language models are used for beam search decoding. A competitive WER for WSJ corpus is achieved using the entire model size of approximately 15MB. The system operates in real-time speed using only a single core ARM without GPU or special hardware. A low-latency on-device speech recognition system with a simple gated convolutional network (SGCN) is also proposed. The SGCN shows a competitive recognition accuracy even with 1M parameters. 8-bit quantization is applied to reduce the memory size and computation time. The proposed system features an online recognition with a 0.4s latency limit and operates in 0.2 RTF with only a single 900MHz CPU core. In addition, an attention-based model with the depthwise convolutional encoder is proposed. Convolutional encoders enable faster training and inference of attention models than recurrent neural network-based ones. However, convolutional models often require a very large receptive field to achieve high recognition accuracy, which not only increases the parameter size but also the computational cost and run-time memory footprint. A convolutional encoder with a short receptive field length often suffers from looping or skipping problems. We believe that this is due to the time-invariance of convolutions. We attempt to remedy this issue by adding positional information to the convolution-based encoder. It is shown that the word error rate (WER) of a convolutional encoder with a short receptive field size can be reduced significantly by augmenting it with positional information. Visualization results are presented to demonstrate the effectiveness of incorporating positional information. The streaming end-to-end ASR model is also developed by applying monotonic chunkwise attention.์ตœ๊ทผ ๋ชจ๋ฐ”์ผ ๋ฐ ์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ์—์„œ ์‹ค์‹œ๊ฐ„ ๋™์ž‘ํ•˜๋Š” ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์„ ๊ฐœ๋ฐœํ•˜๋Š” ๊ฒƒ์ด ํฐ ๊ด€์‹ฌ์„ ๋ฐ›๊ณ  ์žˆ๋‹ค. ๊นŠ์€ ์ธ๊ณต ์‹ ๊ฒฝ๋ง ์Œ์„ฑ์ธ์‹์€ ๋งŽ์€ ์–‘์˜ ์—ฐ์‚ฐ์„ ํ•„์š”๋กœ ํ•˜๋Š” ๋ฐ˜๋ฉด, ๋ชจ๋ฐ”์ผ ๊ธฐ๊ธฐ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์ด๋‚˜ ์ „๋ ฅ์€ ์ œํ•œ๋˜์–ด ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ํ•œ๊ณ„ ๋•Œ๋ฌธ์— ์„œ๋ฒ„ ๊ธฐ๋ฐ˜ ๊ตฌํ˜„์ด ๋ณดํ†ต ์‚ฌ์šฉ๋˜์–ด์ง€์ง€๋งŒ, ์ด๋Š” ์ง€์—ฐ ์‹œ๊ฐ„ ๋ฐ ์‚ฌ์ƒํ™œ ์นจํ•ด ๋ฌธ์ œ๋ฅผ ์ผ์œผํ‚จ๋‹ค. ๋”ฐ๋ผ์„œ ๋ชจ๋ฐ”์ผ ๊ธฐ๊ธฐ ์ƒ ๋™์ž‘ํ•˜๋Š” ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์˜ ์š”๊ตฌ๊ฐ€ ์ปค์ง€๊ณ  ์žˆ๋‹ค. ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์— ์ฃผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๋ชจ๋ธ์€ ์žฌ๊ท€ํ˜• ์ธ๊ณต ์‹ ๊ฒฝ๋ง์ด๋‹ค. ์žฌ๊ท€ํ˜• ์ธ๊ณต ์‹ ๊ฒฝ๋ง์˜ ๋ชจ๋ธ ํฌ๊ธฐ๋Š” ๋ณดํ†ต ์บ์‹œ์˜ ํฌ๊ธฐ๋ณด๋‹ค ํฌ๊ณ  ํ”ผ๋“œ๋ฐฑ ๊ตฌ์กฐ ๋•Œ๋ฌธ์— ์žฌ์‚ฌ์šฉ์ด ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์— ๋งŽ์€ DRAM ์ ‘๊ทผ์„ ํ•„์š”๋กœ ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์ค‘ ์‹œ๊ฐ„์˜ ์ž…๋ ฅ์—๋Œ€ํ•ด ๋ณ‘๋ ฌํ™” ๊ฐ€๋Šฅํ•œ ๋ชจ๋ธ์„ ์ด์šฉํ•œ ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์„ ์ œ์•ˆํ•œ๋‹ค. ๋‹ค์ค‘ ์‹œ๊ฐ„ ๋ณ‘๋ ฌํ™” ๊ธฐ๋ฒ•์€ ํ•œ ๋ฒˆ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์œผ๋กœ ์—ฌ๋Ÿฌ ์‹œ๊ฐ„์˜ ์ถœ๋ ฅ์„ ๋™์‹œ์— ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ๋ณ‘๋ ฌํ™” ์ˆ˜์— ๋”ฐ๋ผ DRAM ์ ‘๊ทผ ํšŸ์ˆ˜๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋ณ‘๋ ฌํ™” ๊ฐ€๋Šฅํ•œ ๋ชจ๋ธ์— ๋Œ€ํ•˜์—ฌ ๋น ๋ฅธ ์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ๋‹จ์ˆœ ์žฌ๊ท€ ์œ ๋‹›๊ณผ 1์ฐจ์› ์ปจ๋ฒŒ๋ฃจ์…˜์„ ์ด์šฉํ•œ CTC ๋ชจ๋ธ์„ ์ œ์‹œํ•˜์˜€๋‹ค. ๋ฌธ์ž์™€ ๋‹จ์–ด ์กฐ๊ฐ ์ˆ˜์ค€์˜ ๋ชจ๋ธ์ด ๊ฐœ๋ฐœ๋˜์—ˆ๋‹ค. ๊ฐ ์ถœ๋ ฅ ๋‹จ์œ„์— ํ•ด๋‹นํ•˜๋Š” ์žฌ๊ท€ํ˜• ์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜ ์–ธ์–ด ๋ชจ๋ธ์„ ์ด์šฉํ•˜์—ฌ ๋””์ฝ”๋”ฉ์— ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ์ „์ฒด 15MB์˜ ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ๋กœ WSJ ์—์„œ ๋†’์€ ์ˆ˜์ค€์˜ ์ธ์‹ ์„ฑ๋Šฅ์„ ์–ป์—ˆ์œผ๋ฉฐ GPU๋‚˜ ๊ธฐํƒ€ ํ•˜๋“œ์›จ์–ด ์—†์ด 1๊ฐœ์˜ ARM CPU ์ฝ”์–ด๋กœ ์‹ค์‹œ๊ฐ„ ์ฒ˜๋ฆฌ๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ๋˜ํ•œ ๋‹จ์ˆœ ์ปจ๋ฒŒ๋ฃจ์…˜ ์ธ๊ณต ์‹ ๊ฒฝ๋ง (SGCN)์„ ์ด์šฉํ•œ ๋‚ฎ์€ ์ง€์—ฐ์‹œ๊ฐ„์„ ๊ฐ€์ง€๋Š” ์Œ์„ฑ์ธ์‹ ์‹œ์Šคํ…œ์„ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค. SGCN์€ 1M์˜ ๋งค์šฐ ๋‚ฎ์€ ๋ณ€์ˆ˜ ๊ฐฏ์ˆ˜๋กœ๋„ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์ธ์‹ ์ •ํ™•๋„๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ 8-bit ์–‘์žํ™”๋ฅผ ์ ์šฉํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ์™€ ์—ฐ์‚ฐ ์‹œ๊ฐ„์„ ๊ฐ์†Œ ์‹œ์ผฐ๋‹ค. ํ•ด๋‹น ์‹œ์Šคํ…œ์€ 0.4์ดˆ์˜ ์ด๋ก ์  ์ง€์—ฐ์‹œ๊ฐ„์„ ๊ฐ€์ง€๋ฉฐ 900MHz์˜ CPU ์ƒ์—์„œ 0.2์˜ RTF๋กœ ๋™์ž‘ํ•˜์˜€๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ, ๊นŠ์ด๋ณ„ ์ปจ๋ฒŒ๋ฃจ์…˜ ์ธ์ฝ”๋”๋ฅผ ์ด์šฉํ•œ ์–ดํ…์…˜ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ด ๊ฐœ๋ฐœ๋˜์—ˆ๋‹ค. ์ปจ๋ฒŒ๋ฃจ์…˜ ๊ธฐ๋ฐ˜์˜ ์ธ์ฝ”๋”๋Š” ์žฌ๊ท€ํ˜• ์ธ๊ณต ์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋ณด๋‹ค ๋น ๋ฅธ ์ฒ˜๋ฆฌ ์†๋„๋ฅผ ๊ฐ€์ง„๋‹ค. ํ•˜์ง€๋งŒ ์ปจ๋ฒŒ๋ฃจ์…˜ ๋ชจ๋ธ์€ ๋†’์€ ์„ฑ๋Šฅ์„ ์œ„ํ•ด์„œ ํฐ ์ž…๋ ฅ ๋ฒ”์œ„๋ฅผ ํ•„์š”๋กœ ํ•œ๋‹ค. ์ด๋Š” ๋ชจ๋ธ ํฌ๊ธฐ ๋ฐ ์—ฐ์‚ฐ๋Ÿ‰, ๊ทธ๋ฆฌ๊ณ  ๋™์ž‘ ์‹œ ๋ฉ”๋ชจ๋ฆฌ ์†Œ๋ชจ๋ฅผ ์ฆ๊ฐ€ ์‹œํ‚จ๋‹ค. ์ž‘์€ ํฌ๊ธฐ์˜ ์ž…๋ ฅ ๋ฒ”์œ„๋ฅผ ๊ฐ€์ง€๋Š” ์ปจ๋ฒŒ๋ฃจ์…˜ ์ธ์ฝ”๋”๋Š” ์ถœ๋ ฅ์˜ ๋ฐ˜๋ณต์ด๋‚˜ ์ƒ๋žต์œผ๋กœ ์ธํ•˜์—ฌ ๋†’์€ ์˜ค์ฐจ์œจ์„ ๊ฐ€์ง„๋‹ค. ์ด๊ฒƒ์€ ์ปจ๋ฒŒ๋ฃจ์…˜์˜ ์‹œ๊ฐ„ ๋ถˆ๋ณ€์„ฑ ๋•Œ๋ฌธ์œผ๋กœ ์—ฌ๊ฒจ์ง€๋ฉฐ, ์ด ๋ฌธ์ œ๋ฅผ ์œ„์น˜ ์ธ์ฝ”๋”ฉ ๋ฒกํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ํ•ด๊ฒฐํ•˜์˜€๋‹ค. ์œ„์น˜ ์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ์ž‘์€ ํฌ๊ธฐ์˜ ํ•„ํ„ฐ๋ฅผ ๊ฐ€์ง€๋Š” ์ปจ๋ฒŒ๋ฃจ์…˜ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๋†’์ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์˜€๋‹ค. ๋˜ํ•œ ์œ„์น˜ ์ •๋ณด๊ฐ€ ๊ฐ€์ง€๋Š” ์˜ํ–ฅ์„ ์‹œ๊ฐํ™” ํ•˜์˜€๋‹ค. ํ•ด๋‹น ๋ฐฉ๋ฒ•์„ ๋‹จ์กฐ ์–ดํ…์…˜์„ ์ด์šฉํ•œ ๋ชจ๋ธ์— ํ™œ์šฉํ•˜์—ฌ ์ปจ๋ฒŒ๋ฃจ์…˜ ๊ธฐ๋ฐ˜์˜ ์ŠคํŠธ๋ฆฌ๋ฐ ๊ฐ€๋Šฅํ•œ ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์„ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค.1 Introduction 1 1.1 End-to-End Automatic Speech Recognition with Neural Networks . . 1 1.2 Challenges on On-device Implementation of Neural Network-based ASR 2 1.3 Parallelizable Neural Network Architecture 3 1.4 Scope of Dissertation 3 2 Simple Recurrent Units for CTC-based End-to-End Speech Recognition 6 2.1 Introduction 6 2.2 Related Works 8 2.3 Speech Recognition Algorithm 9 2.3.1 Acoustic modeling 10 2.3.2 Character-based model 12 2.3.3 Word piece-based model 14 2.3.4 Decoding 14 2.4 Experimental Results 15 2.4.1 Acoustic models 15 2.4.2 Word piece based speech recognition 22 2.4.3 Execution time analysis 25 2.5 Concluding Remarks 27 3 Low-Latency Lightweight Streaming Speech Recognition with 8-bit Quantized Depthwise Gated Convolutional Neural Networks 28 3.1 Introduction 28 3.2 Simple Gated Convolutional Networks 30 3.2.1 Model structure 30 3.2.2 Multi-time-step parallelization 31 3.3 Training CTC AM with SGCN 34 3.3.1 Regularization with symmetrical weight noise injection 34 3.3.2 8-bit quantization 34 3.4 Experimental Results 36 3.4.1 Experimental setting 36 3.4.2 Results on WSJ eval92 38 3.4.3 Implementation on the embedded system 38 3.5 Concluding Remarks 39 4 Effect of Adding Positional Information on Convolutional Neural Networks for End-to-End Speech Recognition 41 4.1 Introduction 41 4.2 Related Works 43 4.3 Model Description 45 4.4 Experimental Results 46 4.4.1 Effect of receptive field size 46 4.4.2 Visualization 49 4.4.3 Comparison with other models 53 4.5 Concluding Remarks 53 5 Convolution-based Attention Model with Positional Encoding for Streaming Speech Recognition 55 5.1 Introduction 55 5.2 Related Works 58 5.3 End-to-End Model for Speech Recognition 61 5.3.1 Model description 61 5.3.2 Monotonic chunkwise attention 62 5.3.3 Positional encoding 63 5.4 Experimental Results 64 5.4.1 Effect of positional encoding 66 5.4.2 Comparison with other models 68 5.4.3 Execution time analysis 70 5.5 Concluding Remarks 71 6 Conclusion 72 Abstract (In Korean) 86Docto

    ํšจ์œจ์ ์ธ ํ‚ค์›Œ๋“œ ์ธ์‹์„ ์œ„ํ•œ ๊ฐ„๋žต ์ฝ˜๋ณผ๋ฃจ์…˜ ์‹ ๊ฒฝ๋ง

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2020. 8. ์„ฑ์›์šฉ.ํ‚ค์›Œ๋“œ ์ŠคํŒŸํŒ…(KWS)์€ ํ˜„์žฌ์˜ ์Œ์„ฑ ๊ธฐ๋ฐ˜ ํœด๋จผ-์ปดํ“จํ„ฐ ์ƒํ˜ธ์ž‘์šฉ์—์„œ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•˜๋ฉฐ ์Šค๋งˆํŠธ ๊ธฐ๊ธฐ์—์„œ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ์‹ ๊ฒฝ๋ง์˜ ๊ธ‰์†ํ•œ ๋ฐœ๋‹ฌ๋กœ ์Œ์„ฑ์ธ์‹, ์Œ์„ฑ ํ•ฉ์„ฑ, ํ™”์ž์ธ์‹ ๋“ฑ ์—ฌ๋Ÿฌ ์Œ์„ฑ ์ฒ˜๋ฆฌ ๋ถ„์•ผ์— ๊ฑธ์นœ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ํฐ ์„ฑ๊ณผ๋ฅผ ๊ฑฐ๋’€๋‹ค. ๋‹ค์–‘ํ•œ ์Œ์„ฑ ์ฒ˜๋ฆฌ ๋ถ„์•ผ์—์„œ ๊ฐ•์ ์„ ๋ณด์ด๊ณ  ์žˆ๋Š” ์ธ๊ณต ์‹ ๊ฒฝ๋ง์€ KWS๋ฅผ ์œ„ํ•œ ์‹œ์Šคํ…œ์—๋„ ๋งค๋ ฅ์ ์ธ ์„ ํƒ์ด ๋˜์—ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ํ™˜๊ฒฝ์€ ์Šค๋งˆํŠธํฐ, ํŒจ๋“œ ๋ฐ ์ผ๋ถ€ ์Šค๋งˆํŠธ ํ™ˆ ๊ธฐ๊ธฐ๋ฅผ ํฌํ•จํ•œ ์†Œํ˜• ์Šค๋งˆํŠธ ๊ธฐ๊ธฐ๋“ค์ด ๋Œ€๋ถ€๋ถ„์ด๊ธฐ ๋•Œ๋ฌธ์—, ์‹ ๊ฒฝ ๋„คํŠธ์›Œํฌ ์•„ํ‚คํ…์ฒ˜๋“ค์€ KWS ์‹œ์Šคํ…œ์„ ์„ค๊ณ„ํ•  ๋•Œ ์ด๋Ÿฌํ•œ ์Šค๋งˆํŠธ ๊ธฐ๊ธฐ์˜ ์ œํ•œ๋œ ๋ฉ”๋ชจ๋ฆฌ์™€ ๊ณ„์‚ฐ ์šฉ๋Ÿ‰์„ ๊ณ ๋ คํ•ด์•ผ ํ•œ๋‹ค. ๋™์‹œ์— ์‹ค์‹œ๊ฐ„, ์‚ฌ์šฉ์ž ์นœํ™”์ , ๋†’์€ ์ •ํ™•๋„๋กœ ๋Œ€์‘ํ•˜๋ ค๋ฉด ๋‚ฎ์€ ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค. ๋˜ํ•œ KWS๋Š” ๋‹ค๋ฅธ ์—…๋ฌด์™€ ๋‹ฌ๋ผ ์ƒ์‹œ ์˜จ๋ผ์ธ ์ƒํƒœ์—์„œ ์ด์šฉ์ž์˜ ํ˜ธ์ถœ์„ ๊ธฐ๋‹ค๋ ค์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— KWS ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ์ „๋ ฅ ์˜ˆ์‚ฐ๋„ ํฌ๊ฒŒ ์ œํ•œ๋œ๋‹ค. ๋ฉ”์ธ์ŠคํŠธ๋ฆผ ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ ์ค‘์—๋Š” ๊ณผ๊ฑฐ DNN, CNN, RNN, ๊ทธ๋ฆฌ๊ณ  ์„œ๋กœ์˜ ์กฐํ•ฉ์ด ์ฃผ๋กœ KWS์— ์‚ฌ์šฉ๋˜๋ฉด์„œ ์ตœ๊ทผ์—๋Š” Attention ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋„ ์ ์  ์ธ๊ธฐ๋ฅผ ๋Œ๊ณ  ์žˆ๋‹ค. ๊ทธ ์ค‘์—์„œ๋„ CNN์€ ์ •ํ™•์„ฑ๊ณผ ๊ฒฌ๊ณ ์„ฑ, ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ๊ฐ€ ๋›ฐ์–ด๋‚˜ KWS์—์„œ ๋„๋ฆฌ ์ฑ„ํƒ๋˜๊ณ  ์žˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ํšจ์œจ์ ์ธ ํ‚ค์›Œ๋“œ ์ŠคํŒŸํŒ…์„ ์ง€์›ํ•˜๋Š” ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ์ธ ์‹ ํ”Œ ์ฝ˜๋ณผ๋ฃจ์…˜ ๋„คํŠธ์›Œํฌ๋ฅผ ์ œ์‹œํ•œ๋‹ค. ๋†’์€ ์ •ํ™•๋„๋ฅผ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•œ ์ค‘๊ฐ„ ๊ณผ์ •์œผ๋กœ ๋ณด๋‹ค ์ปดํŒฉํŠธํ•œ residual ๋„คํŠธ์›Œํฌ์™€ ๋…ธ์ด์ฆˆ ์ธ์‹ ํ›ˆ๋ จ๋ฒ•์„ ์ฃผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค. ResNet์€ ์ข‹์€ ์„ฑ๋Šฅ์„ ์–ป๊ธฐ ์œ„ํ•ด ํ•ญ์ƒ ์ˆ˜์‹ญ๋งŒ ๊ฐœ์˜ ๋งค๊ฐœ ๋ณ€์ˆ˜๋ฅผ ํ•„์š”๋กœ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์šฐ๋ฆฌ ๋ชจ๋ธ์—์„œ๋Š” ํ•œ์ •๋œ ์ž์›์„ ๊ฐ€์ง„ ์Šค๋งˆํŠธ ๊ธฐ๊ธฐ์— ๋” ์ ํ•ฉํ•  ์ˆ˜ ์žˆ๋„๋ก depthwise ์ฝ˜๋ณผ๋ฃจ์…˜ ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋ฅผ ์ค„์ด๋Š” ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ์‹ค์ œ ๋ชจ๋ฐ”์ผ ๊ธฐ๊ธฐ์ธ ์‚ผ์„ฑ ๊ฐค๋Ÿญ์‹œ S6 ์—ฃ์ง€์—์„œ ์ œ์•ˆ๋œ ๋ชจ๋ธ์˜ ์‹ค์ œ ์ถ”๋ก  ์‹œ๊ฐ„(์ฆ‰, ์ง€์—ฐ ์‹œ๊ฐ„)์„ ์ธก์ •ํ•˜์˜€๋‹ค. ์˜จ๋ผ์ธ ์ƒ ๊ณต๊ฐœ๋œ Google ์Œ์„ฑ ๋ช…๋ น ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์ด ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ๊ฒฐ๊ณผ๋Š” ์ œ์‹œ๋œ ๋ชจ๋ธ์ด ๊ธฐ์กด ๋ชจ๋ธ๋ณด๋‹ค ์•ฝ 1/2 ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜์™€ ๊ณ„์‚ฐ ํšŸ์ˆ˜๋ฅผ ํ›จ์”ฌ ์ ๊ฒŒ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ๊ฑฐ์˜ ๋™์ผํ•œ ์ •ํ™•๋„๋กœ ์†๋„๊ฐ€ 17.5 % ๋น ๋ฅด๋ฉฐ 6.9ms์— ๋„๋‹ฌํ–ˆ๋‹ค. ํ›จ์”ฌ ์ž‘์€ ๋ฉ”๋ชจ๋ฆฌ ์†Œ๋ชจ๋กœ๋„ ๋‹ค๋ฅธ ์ตœ์‹  KWS ๋ชจ๋ธ์„ ๋Šฅ๊ฐ€ํ•˜๋Š” 96.59%์˜ ๋†’์€ ์ •ํ™•๋„๋ฅผ ์œ ์ง€ํ•˜๊ณ  ์žˆ๋‹ค.Keyword spotting (KWS) plays an important role in the current speech-based human-computer interaction, and is widely used on smart devices. With the rapid development of neural networks, various applications in speech related fields such as speech recognition, speech synthesis and speaker recognition have achieved great performances. Neural networks have become attractive choices for KWS architectures because of their good performance in speech processing. However, since the application environment is mostly in small smart devices including smart phones, tablets and smart home devices, neural network architectures must consider the limited memory and computation capacity of these smart devices when designing a KWS system . At the same time, the KWS system should be able to maintain low latency in order to respond in real time. In addition, KWS is different from other tasks, because it needs to be always online and waiting for the call from the users, therefore, the power budget of the KWS application is also greatly restricted. Among the mainstream neural network models, FCDNN (fully connected deep neural network), CNN (convolutional neural network), RNN (recurrent neural network) and the combination of them are mainly used for KWS in the past. Recently, attention-based models have become more and more popular. Among them, CNN is widely adopted in KWS, because of its excellent accuracy, robustness, and parallel processing capacity. Parallel processing capacity is essential for low-power implementations. In this work, we present a neural network model-Simple Depthwise Convolutional Network, which supports an efficient keyword spotting. We mainly focus on a more compact Residual Network, and apply noise injection as an intermediate process to maintain high accuracy. Typically, ResNet always requires several hundred thousands parameters to achieve good performance. In our model, we employ depthwise convolutional neural networks to decrease the number of parameters, so that it can be more suitable for smart devices with limited resources. Finally, our model is tested on a real mobile device Samsung Galaxy S6 Edge, reality in the real inference time (that is, latency) of about 6.9ms, which is 17.5% faster than the state-of-the-art model TC-ResNet. The publicly available Google Speech Commands dataset is used to evaluate the models. The results show that we only use about one half of the parameters and at most 300 times fewer number of computations than the original base model, meanwhile, much smaller memory footprint yet maintain the 96.59% comparable high accuracy which outperforms the other state-of-the-art KWS models.1. Introduction 1 1.1 Keyword Spotting System (KWS) 1 1.2 Challenges in Keyword Spotting 6 1.3 Neural Network Architecture for Small-Footprint KWS 6 1.3.1 TDNN-SWSA 7 1.3.2 TC-ResNet 9 1.3.3 DS-CNN 9 1.4 Simple Depthwise Convolutional Neural Network for Efficient KWS 10 1.5 Outline of the Thesis 11 2.Simple Depthwise Convolutional Neural Network 12 2.1 Depthwise ConvNet 12 2.2 Simple Depthwise ConvNet 14 2.3 Residual Simple Depthwise ConvNet 15 2.4 Experiments and Results 17 3. Robustness of Efficient Keyword Spotting 19 3.1 Weight Noise Injection 19 3.2 Experiments on Two Different GSCs 21 3.2.1 Standard GSC 21 3.2.2 Augmented GSC 22 3.2.3 Experiments and Results 22 3.3 FRR and FAR in a 3rd Dataset 24 3.3.1 FRR and FAR 24 3.3.2 The third GSC 24 3.3.3 Experiments and Results 25 4. Conclusions 28 5. Bibliography 29 Abstract (in Korean) 32Maste

    MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition

    Full text link
    We present an MatchboxNet - an end-to-end neural network for speech command recognition. MatchboxNet is a deep residual network composed from blocks of 1D time-channel separable convolution, batch-normalization, ReLU and dropout layers. MatchboxNet reaches state-of-the-art accuracy on the Google Speech Commands dataset while having significantly fewer parameters than similar models. The small footprint of MatchboxNet makes it an attractive candidate for devices with limited computational resources. The model is highly scalable, so model accuracy can be improved with modest additional memory and compute. Finally, we show how intensive data augmentation using an auxiliary noise dataset improves robustness in the presence of background noise

    EASTER: Efficient and Scalable Text Recognizer

    Full text link
    Recent progress in deep learning has led to the development of Optical Character Recognition (OCR) systems which perform remarkably well. Most research has been around recurrent networks as well as complex gated layers which make the overall solution complex and difficult to scale. In this paper, we present an Efficient And Scalable TExt Recognizer (EASTER) to perform optical character recognition on both machine printed and handwritten text. Our model utilises 1-D convolutional layers without any recurrence which enables parallel training with considerably less volume of data. We experimented with multiple variations of our architecture and one of the smallest variant (depth and number of parameter wise) performs comparably to RNN based complex choices. Our 20-layered deepest variant outperforms RNN architectures with a good margin on benchmarking datasets like IIIT-5k and SVT. We also showcase improvements over the current best results on offline handwritten text recognition task. We also present data generation pipelines with augmentation setup to generate synthetic datasets for both handwritten and machine printed text.Comment: 9 pages, fixed typos and minor edit

    Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)

    Get PDF

    COMPUTATIONAL ANALYSIS OF CODE-MULTIPLEXED COULTER SENSOR SIGNALS

    Get PDF
    Nowadays, lab-on-a-chip (LoC) technology has been applied in a variety of applications because of its capability to perform accurate microscale manipulations of cells for point-of-care diagnostics. On the other hand, such a result is not readily available from an LoC device and typically still requires a post-inspection of the chip using traditional laboratory equipment such as a microscope, negating the advantages of the LoC technology. To solve this dilemma, my doctoral research mainly focuses on developing portable and disposable biosensors for interfacing with and digitizing the information from an LoC system. Our sensor platform, integrated with multiple microfluidic impedance sensors, electrically monitors and tracks manipulated cells on an LoC device. The sensor platform compresses information from each sensor into a 1-dimensional electrical waveform, and therefore, further signal processing is required to recover the readout of each sensor and extract information of detected cells. Furthermore, with the capability of the sensor platform, we have introduced integrated microfluidic cytometers to characterize properties of cells such as cell surface expression and mechanical properties.Ph.D
    • โ€ฆ
    corecore