67 research outputs found

    ํšจ์œจ์ ์ธ ํ‚ค์›Œ๋“œ ์ธ์‹์„ ์œ„ํ•œ ๊ฐ„๋žต ์ฝ˜๋ณผ๋ฃจ์…˜ ์‹ ๊ฒฝ๋ง

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2020. 8. ์„ฑ์›์šฉ.ํ‚ค์›Œ๋“œ ์ŠคํŒŸํŒ…(KWS)์€ ํ˜„์žฌ์˜ ์Œ์„ฑ ๊ธฐ๋ฐ˜ ํœด๋จผ-์ปดํ“จํ„ฐ ์ƒํ˜ธ์ž‘์šฉ์—์„œ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•˜๋ฉฐ ์Šค๋งˆํŠธ ๊ธฐ๊ธฐ์—์„œ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ์‹ ๊ฒฝ๋ง์˜ ๊ธ‰์†ํ•œ ๋ฐœ๋‹ฌ๋กœ ์Œ์„ฑ์ธ์‹, ์Œ์„ฑ ํ•ฉ์„ฑ, ํ™”์ž์ธ์‹ ๋“ฑ ์—ฌ๋Ÿฌ ์Œ์„ฑ ์ฒ˜๋ฆฌ ๋ถ„์•ผ์— ๊ฑธ์นœ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ํฐ ์„ฑ๊ณผ๋ฅผ ๊ฑฐ๋’€๋‹ค. ๋‹ค์–‘ํ•œ ์Œ์„ฑ ์ฒ˜๋ฆฌ ๋ถ„์•ผ์—์„œ ๊ฐ•์ ์„ ๋ณด์ด๊ณ  ์žˆ๋Š” ์ธ๊ณต ์‹ ๊ฒฝ๋ง์€ KWS๋ฅผ ์œ„ํ•œ ์‹œ์Šคํ…œ์—๋„ ๋งค๋ ฅ์ ์ธ ์„ ํƒ์ด ๋˜์—ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ํ™˜๊ฒฝ์€ ์Šค๋งˆํŠธํฐ, ํŒจ๋“œ ๋ฐ ์ผ๋ถ€ ์Šค๋งˆํŠธ ํ™ˆ ๊ธฐ๊ธฐ๋ฅผ ํฌํ•จํ•œ ์†Œํ˜• ์Šค๋งˆํŠธ ๊ธฐ๊ธฐ๋“ค์ด ๋Œ€๋ถ€๋ถ„์ด๊ธฐ ๋•Œ๋ฌธ์—, ์‹ ๊ฒฝ ๋„คํŠธ์›Œํฌ ์•„ํ‚คํ…์ฒ˜๋“ค์€ KWS ์‹œ์Šคํ…œ์„ ์„ค๊ณ„ํ•  ๋•Œ ์ด๋Ÿฌํ•œ ์Šค๋งˆํŠธ ๊ธฐ๊ธฐ์˜ ์ œํ•œ๋œ ๋ฉ”๋ชจ๋ฆฌ์™€ ๊ณ„์‚ฐ ์šฉ๋Ÿ‰์„ ๊ณ ๋ คํ•ด์•ผ ํ•œ๋‹ค. ๋™์‹œ์— ์‹ค์‹œ๊ฐ„, ์‚ฌ์šฉ์ž ์นœํ™”์ , ๋†’์€ ์ •ํ™•๋„๋กœ ๋Œ€์‘ํ•˜๋ ค๋ฉด ๋‚ฎ์€ ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค. ๋˜ํ•œ KWS๋Š” ๋‹ค๋ฅธ ์—…๋ฌด์™€ ๋‹ฌ๋ผ ์ƒ์‹œ ์˜จ๋ผ์ธ ์ƒํƒœ์—์„œ ์ด์šฉ์ž์˜ ํ˜ธ์ถœ์„ ๊ธฐ๋‹ค๋ ค์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— KWS ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ์ „๋ ฅ ์˜ˆ์‚ฐ๋„ ํฌ๊ฒŒ ์ œํ•œ๋œ๋‹ค. ๋ฉ”์ธ์ŠคํŠธ๋ฆผ ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ ์ค‘์—๋Š” ๊ณผ๊ฑฐ DNN, CNN, RNN, ๊ทธ๋ฆฌ๊ณ  ์„œ๋กœ์˜ ์กฐํ•ฉ์ด ์ฃผ๋กœ KWS์— ์‚ฌ์šฉ๋˜๋ฉด์„œ ์ตœ๊ทผ์—๋Š” Attention ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋„ ์ ์  ์ธ๊ธฐ๋ฅผ ๋Œ๊ณ  ์žˆ๋‹ค. ๊ทธ ์ค‘์—์„œ๋„ CNN์€ ์ •ํ™•์„ฑ๊ณผ ๊ฒฌ๊ณ ์„ฑ, ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ๊ฐ€ ๋›ฐ์–ด๋‚˜ KWS์—์„œ ๋„๋ฆฌ ์ฑ„ํƒ๋˜๊ณ  ์žˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ํšจ์œจ์ ์ธ ํ‚ค์›Œ๋“œ ์ŠคํŒŸํŒ…์„ ์ง€์›ํ•˜๋Š” ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ์ธ ์‹ ํ”Œ ์ฝ˜๋ณผ๋ฃจ์…˜ ๋„คํŠธ์›Œํฌ๋ฅผ ์ œ์‹œํ•œ๋‹ค. ๋†’์€ ์ •ํ™•๋„๋ฅผ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•œ ์ค‘๊ฐ„ ๊ณผ์ •์œผ๋กœ ๋ณด๋‹ค ์ปดํŒฉํŠธํ•œ residual ๋„คํŠธ์›Œํฌ์™€ ๋…ธ์ด์ฆˆ ์ธ์‹ ํ›ˆ๋ จ๋ฒ•์„ ์ฃผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค. ResNet์€ ์ข‹์€ ์„ฑ๋Šฅ์„ ์–ป๊ธฐ ์œ„ํ•ด ํ•ญ์ƒ ์ˆ˜์‹ญ๋งŒ ๊ฐœ์˜ ๋งค๊ฐœ ๋ณ€์ˆ˜๋ฅผ ํ•„์š”๋กœ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์šฐ๋ฆฌ ๋ชจ๋ธ์—์„œ๋Š” ํ•œ์ •๋œ ์ž์›์„ ๊ฐ€์ง„ ์Šค๋งˆํŠธ ๊ธฐ๊ธฐ์— ๋” ์ ํ•ฉํ•  ์ˆ˜ ์žˆ๋„๋ก depthwise ์ฝ˜๋ณผ๋ฃจ์…˜ ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋ฅผ ์ค„์ด๋Š” ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ์‹ค์ œ ๋ชจ๋ฐ”์ผ ๊ธฐ๊ธฐ์ธ ์‚ผ์„ฑ ๊ฐค๋Ÿญ์‹œ S6 ์—ฃ์ง€์—์„œ ์ œ์•ˆ๋œ ๋ชจ๋ธ์˜ ์‹ค์ œ ์ถ”๋ก  ์‹œ๊ฐ„(์ฆ‰, ์ง€์—ฐ ์‹œ๊ฐ„)์„ ์ธก์ •ํ•˜์˜€๋‹ค. ์˜จ๋ผ์ธ ์ƒ ๊ณต๊ฐœ๋œ Google ์Œ์„ฑ ๋ช…๋ น ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์ด ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ๊ฒฐ๊ณผ๋Š” ์ œ์‹œ๋œ ๋ชจ๋ธ์ด ๊ธฐ์กด ๋ชจ๋ธ๋ณด๋‹ค ์•ฝ 1/2 ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜์™€ ๊ณ„์‚ฐ ํšŸ์ˆ˜๋ฅผ ํ›จ์”ฌ ์ ๊ฒŒ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ๊ฑฐ์˜ ๋™์ผํ•œ ์ •ํ™•๋„๋กœ ์†๋„๊ฐ€ 17.5 % ๋น ๋ฅด๋ฉฐ 6.9ms์— ๋„๋‹ฌํ–ˆ๋‹ค. ํ›จ์”ฌ ์ž‘์€ ๋ฉ”๋ชจ๋ฆฌ ์†Œ๋ชจ๋กœ๋„ ๋‹ค๋ฅธ ์ตœ์‹  KWS ๋ชจ๋ธ์„ ๋Šฅ๊ฐ€ํ•˜๋Š” 96.59%์˜ ๋†’์€ ์ •ํ™•๋„๋ฅผ ์œ ์ง€ํ•˜๊ณ  ์žˆ๋‹ค.Keyword spotting (KWS) plays an important role in the current speech-based human-computer interaction, and is widely used on smart devices. With the rapid development of neural networks, various applications in speech related fields such as speech recognition, speech synthesis and speaker recognition have achieved great performances. Neural networks have become attractive choices for KWS architectures because of their good performance in speech processing. However, since the application environment is mostly in small smart devices including smart phones, tablets and smart home devices, neural network architectures must consider the limited memory and computation capacity of these smart devices when designing a KWS system . At the same time, the KWS system should be able to maintain low latency in order to respond in real time. In addition, KWS is different from other tasks, because it needs to be always online and waiting for the call from the users, therefore, the power budget of the KWS application is also greatly restricted. Among the mainstream neural network models, FCDNN (fully connected deep neural network), CNN (convolutional neural network), RNN (recurrent neural network) and the combination of them are mainly used for KWS in the past. Recently, attention-based models have become more and more popular. Among them, CNN is widely adopted in KWS, because of its excellent accuracy, robustness, and parallel processing capacity. Parallel processing capacity is essential for low-power implementations. In this work, we present a neural network model-Simple Depthwise Convolutional Network, which supports an efficient keyword spotting. We mainly focus on a more compact Residual Network, and apply noise injection as an intermediate process to maintain high accuracy. Typically, ResNet always requires several hundred thousands parameters to achieve good performance. In our model, we employ depthwise convolutional neural networks to decrease the number of parameters, so that it can be more suitable for smart devices with limited resources. Finally, our model is tested on a real mobile device Samsung Galaxy S6 Edge, reality in the real inference time (that is, latency) of about 6.9ms, which is 17.5% faster than the state-of-the-art model TC-ResNet. The publicly available Google Speech Commands dataset is used to evaluate the models. The results show that we only use about one half of the parameters and at most 300 times fewer number of computations than the original base model, meanwhile, much smaller memory footprint yet maintain the 96.59% comparable high accuracy which outperforms the other state-of-the-art KWS models.1. Introduction 1 1.1 Keyword Spotting System (KWS) 1 1.2 Challenges in Keyword Spotting 6 1.3 Neural Network Architecture for Small-Footprint KWS 6 1.3.1 TDNN-SWSA 7 1.3.2 TC-ResNet 9 1.3.3 DS-CNN 9 1.4 Simple Depthwise Convolutional Neural Network for Efficient KWS 10 1.5 Outline of the Thesis 11 2.Simple Depthwise Convolutional Neural Network 12 2.1 Depthwise ConvNet 12 2.2 Simple Depthwise ConvNet 14 2.3 Residual Simple Depthwise ConvNet 15 2.4 Experiments and Results 17 3. Robustness of Efficient Keyword Spotting 19 3.1 Weight Noise Injection 19 3.2 Experiments on Two Different GSCs 21 3.2.1 Standard GSC 21 3.2.2 Augmented GSC 22 3.2.3 Experiments and Results 22 3.3 FRR and FAR in a 3rd Dataset 24 3.3.1 FRR and FAR 24 3.3.2 The third GSC 24 3.3.3 Experiments and Results 25 4. Conclusions 28 5. Bibliography 29 Abstract (in Korean) 32Maste

    Deep Spoken Keyword Spotting:An Overview

    Get PDF
    Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams and has become a fast-growing technology thanks to the paradigm shift introduced by deep learning a few years ago. This has allowed the rapid embedding of deep KWS in a myriad of small electronic devices with different purposes like the activation of voice assistants. Prospects suggest a sustained growth in terms of social use of this technology. Thus, it is not surprising that deep KWS has become a hot research topic among speech scientists, who constantly look for KWS performance improvement and computational complexity reduction. This context motivates this paper, in which we conduct a literature review into deep spoken KWS to assist practitioners and researchers who are interested in this technology. Specifically, this overview has a comprehensive nature by covering a thorough analysis of deep KWS systems (which includes speech features, acoustic modeling and posterior handling), robustness methods, applications, datasets, evaluation metrics, performance of deep KWS systems and audio-visual KWS. The analysis performed in this paper allows us to identify a number of directions for future research, including directions adopted from automatic speech recognition research and directions that are unique to the problem of spoken KWS

    Multilingual Query-by-Example Keyword Spotting with Metric Learning and Phoneme-to-Embedding Mapping

    Full text link
    In this paper, we propose a multilingual query-by-example keyword spotting (KWS) system based on a residual neural network. The model is trained as a classifier on a multilingual keyword dataset extracted from Common Voice sentences and fine-tuned using circle loss. We demonstrate the generalization ability of the model to new languages and report a mean reduction in EER of 59.2 % for previously seen and 47.9 % for unseen languages compared to a competitive baseline. We show that the word embeddings learned by the KWS model can be accurately predicted from the phoneme sequences using a simple LSTM model. Our system achieves a promising accuracy for streaming keyword spotting and keyword search on Common Voice audio using just 5 examples per keyword. Experiments on the Hey-Snips dataset show a good performance with a false negative rate of 5.4 % at only 0.1 false alarms per hour.Comment: Accepted to ICASSP 202
    • โ€ฆ
    corecore