183 research outputs found

    ํšจ์œจ์ ์ธ ํ‚ค์›Œ๋“œ ์ธ์‹์„ ์œ„ํ•œ ๊ฐ„๋žต ์ฝ˜๋ณผ๋ฃจ์…˜ ์‹ ๊ฒฝ๋ง

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2020. 8. ์„ฑ์›์šฉ.ํ‚ค์›Œ๋“œ ์ŠคํŒŸํŒ…(KWS)์€ ํ˜„์žฌ์˜ ์Œ์„ฑ ๊ธฐ๋ฐ˜ ํœด๋จผ-์ปดํ“จํ„ฐ ์ƒํ˜ธ์ž‘์šฉ์—์„œ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•˜๋ฉฐ ์Šค๋งˆํŠธ ๊ธฐ๊ธฐ์—์„œ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ์‹ ๊ฒฝ๋ง์˜ ๊ธ‰์†ํ•œ ๋ฐœ๋‹ฌ๋กœ ์Œ์„ฑ์ธ์‹, ์Œ์„ฑ ํ•ฉ์„ฑ, ํ™”์ž์ธ์‹ ๋“ฑ ์—ฌ๋Ÿฌ ์Œ์„ฑ ์ฒ˜๋ฆฌ ๋ถ„์•ผ์— ๊ฑธ์นœ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ํฐ ์„ฑ๊ณผ๋ฅผ ๊ฑฐ๋’€๋‹ค. ๋‹ค์–‘ํ•œ ์Œ์„ฑ ์ฒ˜๋ฆฌ ๋ถ„์•ผ์—์„œ ๊ฐ•์ ์„ ๋ณด์ด๊ณ  ์žˆ๋Š” ์ธ๊ณต ์‹ ๊ฒฝ๋ง์€ KWS๋ฅผ ์œ„ํ•œ ์‹œ์Šคํ…œ์—๋„ ๋งค๋ ฅ์ ์ธ ์„ ํƒ์ด ๋˜์—ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ํ™˜๊ฒฝ์€ ์Šค๋งˆํŠธํฐ, ํŒจ๋“œ ๋ฐ ์ผ๋ถ€ ์Šค๋งˆํŠธ ํ™ˆ ๊ธฐ๊ธฐ๋ฅผ ํฌํ•จํ•œ ์†Œํ˜• ์Šค๋งˆํŠธ ๊ธฐ๊ธฐ๋“ค์ด ๋Œ€๋ถ€๋ถ„์ด๊ธฐ ๋•Œ๋ฌธ์—, ์‹ ๊ฒฝ ๋„คํŠธ์›Œํฌ ์•„ํ‚คํ…์ฒ˜๋“ค์€ KWS ์‹œ์Šคํ…œ์„ ์„ค๊ณ„ํ•  ๋•Œ ์ด๋Ÿฌํ•œ ์Šค๋งˆํŠธ ๊ธฐ๊ธฐ์˜ ์ œํ•œ๋œ ๋ฉ”๋ชจ๋ฆฌ์™€ ๊ณ„์‚ฐ ์šฉ๋Ÿ‰์„ ๊ณ ๋ คํ•ด์•ผ ํ•œ๋‹ค. ๋™์‹œ์— ์‹ค์‹œ๊ฐ„, ์‚ฌ์šฉ์ž ์นœํ™”์ , ๋†’์€ ์ •ํ™•๋„๋กœ ๋Œ€์‘ํ•˜๋ ค๋ฉด ๋‚ฎ์€ ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค. ๋˜ํ•œ KWS๋Š” ๋‹ค๋ฅธ ์—…๋ฌด์™€ ๋‹ฌ๋ผ ์ƒ์‹œ ์˜จ๋ผ์ธ ์ƒํƒœ์—์„œ ์ด์šฉ์ž์˜ ํ˜ธ์ถœ์„ ๊ธฐ๋‹ค๋ ค์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— KWS ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ์ „๋ ฅ ์˜ˆ์‚ฐ๋„ ํฌ๊ฒŒ ์ œํ•œ๋œ๋‹ค. ๋ฉ”์ธ์ŠคํŠธ๋ฆผ ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ ์ค‘์—๋Š” ๊ณผ๊ฑฐ DNN, CNN, RNN, ๊ทธ๋ฆฌ๊ณ  ์„œ๋กœ์˜ ์กฐํ•ฉ์ด ์ฃผ๋กœ KWS์— ์‚ฌ์šฉ๋˜๋ฉด์„œ ์ตœ๊ทผ์—๋Š” Attention ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋„ ์ ์  ์ธ๊ธฐ๋ฅผ ๋Œ๊ณ  ์žˆ๋‹ค. ๊ทธ ์ค‘์—์„œ๋„ CNN์€ ์ •ํ™•์„ฑ๊ณผ ๊ฒฌ๊ณ ์„ฑ, ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ๊ฐ€ ๋›ฐ์–ด๋‚˜ KWS์—์„œ ๋„๋ฆฌ ์ฑ„ํƒ๋˜๊ณ  ์žˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ํšจ์œจ์ ์ธ ํ‚ค์›Œ๋“œ ์ŠคํŒŸํŒ…์„ ์ง€์›ํ•˜๋Š” ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ์ธ ์‹ ํ”Œ ์ฝ˜๋ณผ๋ฃจ์…˜ ๋„คํŠธ์›Œํฌ๋ฅผ ์ œ์‹œํ•œ๋‹ค. ๋†’์€ ์ •ํ™•๋„๋ฅผ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•œ ์ค‘๊ฐ„ ๊ณผ์ •์œผ๋กœ ๋ณด๋‹ค ์ปดํŒฉํŠธํ•œ residual ๋„คํŠธ์›Œํฌ์™€ ๋…ธ์ด์ฆˆ ์ธ์‹ ํ›ˆ๋ จ๋ฒ•์„ ์ฃผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค. ResNet์€ ์ข‹์€ ์„ฑ๋Šฅ์„ ์–ป๊ธฐ ์œ„ํ•ด ํ•ญ์ƒ ์ˆ˜์‹ญ๋งŒ ๊ฐœ์˜ ๋งค๊ฐœ ๋ณ€์ˆ˜๋ฅผ ํ•„์š”๋กœ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์šฐ๋ฆฌ ๋ชจ๋ธ์—์„œ๋Š” ํ•œ์ •๋œ ์ž์›์„ ๊ฐ€์ง„ ์Šค๋งˆํŠธ ๊ธฐ๊ธฐ์— ๋” ์ ํ•ฉํ•  ์ˆ˜ ์žˆ๋„๋ก depthwise ์ฝ˜๋ณผ๋ฃจ์…˜ ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋ฅผ ์ค„์ด๋Š” ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ์‹ค์ œ ๋ชจ๋ฐ”์ผ ๊ธฐ๊ธฐ์ธ ์‚ผ์„ฑ ๊ฐค๋Ÿญ์‹œ S6 ์—ฃ์ง€์—์„œ ์ œ์•ˆ๋œ ๋ชจ๋ธ์˜ ์‹ค์ œ ์ถ”๋ก  ์‹œ๊ฐ„(์ฆ‰, ์ง€์—ฐ ์‹œ๊ฐ„)์„ ์ธก์ •ํ•˜์˜€๋‹ค. ์˜จ๋ผ์ธ ์ƒ ๊ณต๊ฐœ๋œ Google ์Œ์„ฑ ๋ช…๋ น ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์ด ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ๊ฒฐ๊ณผ๋Š” ์ œ์‹œ๋œ ๋ชจ๋ธ์ด ๊ธฐ์กด ๋ชจ๋ธ๋ณด๋‹ค ์•ฝ 1/2 ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜์™€ ๊ณ„์‚ฐ ํšŸ์ˆ˜๋ฅผ ํ›จ์”ฌ ์ ๊ฒŒ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ๊ฑฐ์˜ ๋™์ผํ•œ ์ •ํ™•๋„๋กœ ์†๋„๊ฐ€ 17.5 % ๋น ๋ฅด๋ฉฐ 6.9ms์— ๋„๋‹ฌํ–ˆ๋‹ค. ํ›จ์”ฌ ์ž‘์€ ๋ฉ”๋ชจ๋ฆฌ ์†Œ๋ชจ๋กœ๋„ ๋‹ค๋ฅธ ์ตœ์‹  KWS ๋ชจ๋ธ์„ ๋Šฅ๊ฐ€ํ•˜๋Š” 96.59%์˜ ๋†’์€ ์ •ํ™•๋„๋ฅผ ์œ ์ง€ํ•˜๊ณ  ์žˆ๋‹ค.Keyword spotting (KWS) plays an important role in the current speech-based human-computer interaction, and is widely used on smart devices. With the rapid development of neural networks, various applications in speech related fields such as speech recognition, speech synthesis and speaker recognition have achieved great performances. Neural networks have become attractive choices for KWS architectures because of their good performance in speech processing. However, since the application environment is mostly in small smart devices including smart phones, tablets and smart home devices, neural network architectures must consider the limited memory and computation capacity of these smart devices when designing a KWS system . At the same time, the KWS system should be able to maintain low latency in order to respond in real time. In addition, KWS is different from other tasks, because it needs to be always online and waiting for the call from the users, therefore, the power budget of the KWS application is also greatly restricted. Among the mainstream neural network models, FCDNN (fully connected deep neural network), CNN (convolutional neural network), RNN (recurrent neural network) and the combination of them are mainly used for KWS in the past. Recently, attention-based models have become more and more popular. Among them, CNN is widely adopted in KWS, because of its excellent accuracy, robustness, and parallel processing capacity. Parallel processing capacity is essential for low-power implementations. In this work, we present a neural network model-Simple Depthwise Convolutional Network, which supports an efficient keyword spotting. We mainly focus on a more compact Residual Network, and apply noise injection as an intermediate process to maintain high accuracy. Typically, ResNet always requires several hundred thousands parameters to achieve good performance. In our model, we employ depthwise convolutional neural networks to decrease the number of parameters, so that it can be more suitable for smart devices with limited resources. Finally, our model is tested on a real mobile device Samsung Galaxy S6 Edge, reality in the real inference time (that is, latency) of about 6.9ms, which is 17.5% faster than the state-of-the-art model TC-ResNet. The publicly available Google Speech Commands dataset is used to evaluate the models. The results show that we only use about one half of the parameters and at most 300 times fewer number of computations than the original base model, meanwhile, much smaller memory footprint yet maintain the 96.59% comparable high accuracy which outperforms the other state-of-the-art KWS models.1. Introduction 1 1.1 Keyword Spotting System (KWS) 1 1.2 Challenges in Keyword Spotting 6 1.3 Neural Network Architecture for Small-Footprint KWS 6 1.3.1 TDNN-SWSA 7 1.3.2 TC-ResNet 9 1.3.3 DS-CNN 9 1.4 Simple Depthwise Convolutional Neural Network for Efficient KWS 10 1.5 Outline of the Thesis 11 2.Simple Depthwise Convolutional Neural Network 12 2.1 Depthwise ConvNet 12 2.2 Simple Depthwise ConvNet 14 2.3 Residual Simple Depthwise ConvNet 15 2.4 Experiments and Results 17 3. Robustness of Efficient Keyword Spotting 19 3.1 Weight Noise Injection 19 3.2 Experiments on Two Different GSCs 21 3.2.1 Standard GSC 21 3.2.2 Augmented GSC 22 3.2.3 Experiments and Results 22 3.3 FRR and FAR in a 3rd Dataset 24 3.3.1 FRR and FAR 24 3.3.2 The third GSC 24 3.3.3 Experiments and Results 25 4. Conclusions 28 5. Bibliography 29 Abstract (in Korean) 32Maste

    LiCo-Net: Linearized Convolution Network for Hardware-efficient Keyword Spotting

    Full text link
    This paper proposes a hardware-efficient architecture, Linearized Convolution Network (LiCo-Net) for keyword spotting. It is optimized specifically for low-power processor units like microcontrollers. ML operators exhibit heterogeneous efficiency profiles on power-efficient hardware. Given the exact theoretical computation cost, int8 operators are more computation-effective than float operators, and linear layers are often more efficient than other layers. The proposed LiCo-Net is a dual-phase system that uses the efficient int8 linear operators at the inference phase and applies streaming convolutions at the training phase to maintain a high model capacity. The experimental results show that LiCo-Net outperforms single-value decomposition filter (SVDF) on hardware efficiency with on-par detection performance. Compared to SVDF, LiCo-Net reduces cycles by 40% on HiFi4 DSP
    • โ€ฆ
    corecore