Search CORE

67 research outputs found

효율적인 키워드 인식을 위한 간략 콘볼루션 신경망

Author: QIAN XUE
Publication venue: 서울대학교 대학원
Publication date: 01/08/2020
Field of study

학위논문 (석사) -- 서울대학교 대학원 : 공과대학 전기·정보공학부, 2020. 8. 성원용.키워드 스팟팅(KWS)은 현재의 음성 기반 휴먼-컴퓨터 상호작용에서 중요한 역할을 하며 스마트 기기에서 널리 사용되고 있다. 신경망의 급속한 발달로 음성인식, 음성 합성, 화자인식 등 여러 음성 처리 분야에 걸친 어플리케이션에서 큰 성과를 거뒀다. 다양한 음성 처리 분야에서 강점을 보이고 있는 인공 신경망은 KWS를 위한 시스템에도 매력적인 선택이 되었다. 그러나 애플리케이션 환경은 스마트폰, 패드 및 일부 스마트 홈 기기를 포함한 소형 스마트 기기들이 대부분이기 때문에, 신경 네트워크 아키텍처들은 KWS 시스템을 설계할 때 이러한 스마트 기기의 제한된 메모리와 계산 용량을 고려해야 한다. 동시에 실시간, 사용자 친화적, 높은 정확도로 대응하려면 낮은 대기 시간을 유지할 수 있어야 한다. 또한 KWS는 다른 업무와 달라 상시 온라인 상태에서 이용자의 호출을 기다려야 하기 때문에 KWS 애플리케이션의 전력 예산도 크게 제한된다. 메인스트림 신경망 모델 중에는 과거 DNN, CNN, RNN, 그리고 서로의 조합이 주로 KWS에 사용되면서 최근에는 Attention 기반 모델도 점점 인기를 끌고 있다. 그 중에서도 CNN은 정확성과 견고성, 병렬처리가 뛰어나 KWS에서 널리 채택되고 있다. 본 연구에서는 효율적인 키워드 스팟팅을 지원하는 신경망 모델인 신플 콘볼루션 네트워크를 제시한다. 높은 정확도를 유지하기 위한 중간 과정으로 보다 컴팩트한 residual 네트워크와 노이즈 인식 훈련법을 주로 사용한다. ResNet은 좋은 성능을 얻기 위해 항상 수십만 개의 매개 변수를 필요로 하기 때문에, 우리 모델에서는 한정된 자원을 가진 스마트 기기에 더 적합할 수 있도록 depthwise 콘볼루션 네트워크를 사용하여 파라미터 수를 줄이는 법을 제시한다. 마지막으로 실제 모바일 기기인 삼성 갤럭시 S6 엣지에서 제안된 모델의 실제 추론 시간(즉, 지연 시간)을 측정하였다. 온라인 상 공개된 Google 음성 명령 데이터 집합이 모델을 평가하는 데 사용되었다. 결과는 제시된 모델이 기존 모델보다 약 1/2 의 매개변수와 계산 횟수를 훨씬 적게 사용한다는 것을 보여주며거의 동일한 정확도로 속도가 17.5 % 빠르며 6.9ms에 도달했다. 훨씬 작은 메모리 소모로도 다른 최신 KWS 모델을 능가하는 96.59%의 높은 정확도를 유지하고 있다.Keyword spotting (KWS) plays an important role in the current speech-based human-computer interaction, and is widely used on smart devices. With the rapid development of neural networks, various applications in speech related fields such as speech recognition, speech synthesis and speaker recognition have achieved great performances. Neural networks have become attractive choices for KWS architectures because of their good performance in speech processing. However, since the application environment is mostly in small smart devices including smart phones, tablets and smart home devices, neural network architectures must consider the limited memory and computation capacity of these smart devices when designing a KWS system . At the same time, the KWS system should be able to maintain low latency in order to respond in real time. In addition, KWS is different from other tasks, because it needs to be always online and waiting for the call from the users, therefore, the power budget of the KWS application is also greatly restricted. Among the mainstream neural network models, FCDNN (fully connected deep neural network), CNN (convolutional neural network), RNN (recurrent neural network) and the combination of them are mainly used for KWS in the past. Recently, attention-based models have become more and more popular. Among them, CNN is widely adopted in KWS, because of its excellent accuracy, robustness, and parallel processing capacity. Parallel processing capacity is essential for low-power implementations. In this work, we present a neural network model-Simple Depthwise Convolutional Network, which supports an efficient keyword spotting. We mainly focus on a more compact Residual Network, and apply noise injection as an intermediate process to maintain high accuracy. Typically, ResNet always requires several hundred thousands parameters to achieve good performance. In our model, we employ depthwise convolutional neural networks to decrease the number of parameters, so that it can be more suitable for smart devices with limited resources. Finally, our model is tested on a real mobile device Samsung Galaxy S6 Edge, reality in the real inference time (that is, latency) of about 6.9ms, which is 17.5% faster than the state-of-the-art model TC-ResNet. The publicly available Google Speech Commands dataset is used to evaluate the models. The results show that we only use about one half of the parameters and at most 300 times fewer number of computations than the original base model, meanwhile, much smaller memory footprint yet maintain the 96.59% comparable high accuracy which outperforms the other state-of-the-art KWS models.1. Introduction 1 1.1 Keyword Spotting System (KWS) 1 1.2 Challenges in Keyword Spotting 6 1.3 Neural Network Architecture for Small-Footprint KWS 6 1.3.1 TDNN-SWSA 7 1.3.2 TC-ResNet 9 1.3.3 DS-CNN 9 1.4 Simple Depthwise Convolutional Neural Network for Efficient KWS 10 1.5 Outline of the Thesis 11 2.Simple Depthwise Convolutional Neural Network 12 2.1 Depthwise ConvNet 12 2.2 Simple Depthwise ConvNet 14 2.3 Residual Simple Depthwise ConvNet 15 2.4 Experiments and Results 17 3. Robustness of Efficient Keyword Spotting 19 3.1 Weight Noise Injection 19 3.2 Experiments on Two Different GSCs 21 3.2.1 Standard GSC 21 3.2.2 Augmented GSC 22 3.2.3 Experiments and Results 22 3.3 FRR and FAR in a 3rd Dataset 24 3.3.1 FRR and FAR 24 3.3.2 The third GSC 24 3.3.3 Experiments and Results 25 4. Conclusions 28 5. Bibliography 29 Abstract (in Korean) 32Maste

SNU Open Repository and Archive

Deep Spoken Keyword Spotting:An Overview

Author: Espejo Ivan Lopez
Hansen John
Jensen Jesper
Tan Zheng-Hua
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 20/11/2021
Field of study

Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams and has become a fast-growing technology thanks to the paradigm shift introduced by deep learning a few years ago. This has allowed the rapid embedding of deep KWS in a myriad of small electronic devices with different purposes like the activation of voice assistants. Prospects suggest a sustained growth in terms of social use of this technology. Thus, it is not surprising that deep KWS has become a hot research topic among speech scientists, who constantly look for KWS performance improvement and computational complexity reduction. This context motivates this paper, in which we conduct a literature review into deep spoken KWS to assist practitioners and researchers who are interested in this technology. Specifically, this overview has a comprehensive nature by covering a thorough analysis of deep KWS systems (which includes speech features, acoustic modeling and posterior handling), robustness methods, applications, datasets, evaluation metrics, performance of deep KWS systems and audio-visual KWS. The analysis performed in this paper allows us to identify a number of directions for future research, including directions adopted from automatic speech recognition research and directions that are unique to the problem of spoken KWS

arXiv.org e-Print Archive

VBN

Multilingual Query-by-Example Keyword Spotting with Metric Learning and Phoneme-to-Embedding Mapping

Author: Meyer Bernd T.
Reuter Paul M.
Rollwage Christian
Publication venue
Publication date: 19/04/2023
Field of study

In this paper, we propose a multilingual query-by-example keyword spotting (KWS) system based on a residual neural network. The model is trained as a classifier on a multilingual keyword dataset extracted from Common Voice sentences and fine-tuned using circle loss. We demonstrate the generalization ability of the model to new languages and report a mean reduction in EER of 59.2 % for previously seen and 47.9 % for unseen languages compared to a competitive baseline. We show that the word embeddings learned by the KWS model can be accurately predicted from the phoneme sequences using a simple LSTM model. Our system achieves a promising accuracy for streaming keyword spotting and keyword search on Common Voice audio using just 5 examples per keyword. Experiments on the Hey-Snips dataset show a good performance with a false negative rate of 5.4 % at only 0.1 false alarms per hour.Comment: Accepted to ICASSP 202

arXiv.org e-Print Archive