Search CORE

24 research outputs found

Deep Spoken Keyword Spotting:An Overview

Author: Espejo Ivan Lopez
Hansen John
Jensen Jesper
Tan Zheng-Hua
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 20/11/2021
Field of study

Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams and has become a fast-growing technology thanks to the paradigm shift introduced by deep learning a few years ago. This has allowed the rapid embedding of deep KWS in a myriad of small electronic devices with different purposes like the activation of voice assistants. Prospects suggest a sustained growth in terms of social use of this technology. Thus, it is not surprising that deep KWS has become a hot research topic among speech scientists, who constantly look for KWS performance improvement and computational complexity reduction. This context motivates this paper, in which we conduct a literature review into deep spoken KWS to assist practitioners and researchers who are interested in this technology. Specifically, this overview has a comprehensive nature by covering a thorough analysis of deep KWS systems (which includes speech features, acoustic modeling and posterior handling), robustness methods, applications, datasets, evaluation metrics, performance of deep KWS systems and audio-visual KWS. The analysis performed in this paper allows us to identify a number of directions for future research, including directions adopted from automatic speech recognition research and directions that are unique to the problem of spoken KWS

arXiv.org e-Print Archive

VBN

Honkling: In-Browser Personalization for Ubiquitous Keyword Spotting

Author: Lee Jaejun
Publication venue: 'University of Waterloo'
Publication date: 29/11/2019
Field of study

Used for simple voice commands and wake-word detection, keyword spotting (KWS) is the task of detecting pre-determined keywords in a stream of utterances. A common implementation of KWS involves transmitting audio samples over the network and detecting target keywords in the cloud with neural networks because on-device application development presents compatibility issues with various edge devices and provides limited supports for deep learning. Unfortunately, such an architecture can lead to unpleasant user experiences because network latency is not deterministic. Furthermore, the client-server architecture raises privacy concerns because users lose control over the audio data once it leaves the edge device. In this thesis, I present Honkling, a novel, JavaScript-based KWS system. Unlike previous KWS systems, Honkling operates purely on the client-side—Honkling is decentralized and serverless. Given that it is implemented in JavaScript, Honkling can be deployed directly in the browser, achieving higher compatibility and efficiency than the existing client-server architecture. From a comprehensive efficiency evaluation on desktops, laptops, and mobile devices, it is found that in-browser keyword detection only takes 0.5 seconds and achieves a high accuracy of 94% on the Google Speech Commands dataset. From an empirical study, the accuracy of Honkling is found to be inconsistent in practice due to different accents. To ensure high detection accuracy for every user, I explore fine-tuning the trained model with user personalized recordings. From my thorough experiments, it is found that such a process can increase the absolute accuracy up to 10% with only five recordings per keyword. Furthermore, the study shows that in-browser fine-tuning only takes eight seconds in the presence of hardware acceleration

University of Waterloo's Institutional Repository

온라인 영상의 텍스트 검출 및 인식을 위한 신경망 문맥 모델

Author: 강철무
Publication venue: 서울대학교 대학원
Publication date: 01/02/2017
Field of study

학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2017. 2. 유석인.We address the problem of detecting and recognizing the text embedded in online images that are circulated over the Web. Our idea is to leverage context information for both text detection and recognition. For detection, we use local image context around the text region, based on that the text often sequentially appear in online images. For recognition, we exploit the metadata associated with the input online image, including tags, comments, and title, which are used as a topic prior for the word candidates in the image. To infuse such two sets of context information, we propose a contextual text spotting network (CTSN).We perform comparative evaluation with ve state-of-the-art text spotting methods on newly collected Instagram and Flickr datasets. We show that our approach that benets from context information is more successful for text spotting in online images.Chapter 1 Introduction 1 Chapter 2 RelatedWork 5 Chapter 3 Preliminary of Neural Networks 9 3.1 Basic of Neural Network 9 3.2 Convolutional Neural Network 12 3.3 Pooling Layer 13 3.4 Activation Function 15 3.5 Recurrent Neural Network 16 3.6 Back-Propagation Through Time 16 3.7 Bidirectional Recurrent Neural Networks 17 3.8 Long-Short Term Memory 17 3.9 Optimization 18 3.10 Training Loss 19 3.11 Training Process 20 Chapter 4 Approach for Contextual Text Spotting 25 4.1 Overview of the Proposed Framework 25 4.2 Context-Aware Text Detection 26 4.2.1 Text Proposals 27 4.2.2 Text Detection Network 28 4.2.3 Extraction of Textline Boxes 33 4.2.4 Text Detection Network Variants 37 4.3 Context-Aware Word Recognition 40 4.3.1 Bias Networks for Context 42 4.3.2 Recurrent Word Recognition Network 43 4.3.3 Recognition Network Variant 46 Chapter 5 Experiments 52 5.1 Dataset 52 5.2 Experimental Setup 55 5.3 Training 56 5.4 Hyperparameters of Contextual Model 59 5.5 Neural Network Architecture Variants 71 5.6 Results 74 Chapter 6 Conclusion 86 요약 102Docto

SNU Open Repository and Archive

Advances in Image Processing, Analysis and Recognition Technology

Author
Publication venue: 'MDPI AG'
Publication date: 21/06/2022
Field of study

For many decades, researchers have been trying to make computers’ analysis of images as effective as the system of human vision is. For this purpose, many algorithms and systems have previously been created. The whole process covers various stages, including image processing, representation and recognition. The results of this work can be applied to many computer-assisted areas of everyday life. They improve particular activities and provide handy tools, which are sometimes only for entertainment, but quite often, they significantly increase our safety. In fact, the practical implementation of image processing algorithms is particularly wide. Moreover, the rapid growth of computational complexity and computer efficiency has allowed for the development of more sophisticated and effective algorithms and tools. Although significant progress has been made so far, many issues still remain, resulting in the need for the development of novel approaches

Directory of Open Access Books (DOAB)

On Improving Generalization of CNN-Based Image Classification with Delineation Maps Using the CORF Push-Pull Inhibition Operator

Author: Antonisse Joey
Azzopardi George
Bennabhaktula Swaroop
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 31/10/2021
Field of study

Deployed image classification pipelines are typically dependent on the images captured in real-world environments. This means that images might be affected by different sources of perturbations (e.g. sensor noise in low-light environments). The main challenge arises by the fact that image quality directly impacts the reliability and consistency of classification tasks. This challenge has, hence, attracted wide interest within the computer vision communities. We propose a transformation step that attempts to enhance the generalization ability of CNN models in the presence of unseen noise in the test set. Concretely, the delineation maps of given images are determined using the CORF push-pull inhibition operator. Such an operation transforms an input image into a space that is more robust to noise before being processed by a CNN. We evaluated our approach on the Fashion MNIST data set with an AlexNet model. It turned out that the proposed CORF-augmented pipeline achieved comparable results on noise-free images to those of a conventional AlexNet classification model without CORF delineation maps, but it consistently achieved significantly superior performance on test images perturbed with different levels of Gaussian and uniform noise

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

Contributions and applications around low resource deep learning modeling

Author: Vallés Pérez Iván
Publication venue
Publication date: 01/01/2023
Field of study

El aprendizaje profundo representa la vanguardia del aprendizaje automático en multitud de aplicaciones. Muchas de estas tareas requieren una gran cantidad de recursos computacionales, lo que limita su adopción en dispositivos integrados. El objetivo principal de esta tesis es estudiar métodos y algoritmos que permiten abordar problemas utilizando aprendizaje profundo con bajos recursos computacionales. Este trabajo también tiene como objetivo presentar aplicaciones de aprendizaje profundo en la industria. La primera contribución es una nueva función de activación para redes de aprendizaje profundo: la función de módulo. Los experimentos muestran que la función de activación propuesta logra resultados superiores en tareas de visión artificial cuando se compara con las alternativas encontradas en la literatura. La segunda contribución es una nueva estrategia para combinar modelos preentrenados usando destilación de conocimiento. Los resultados de este capítulo muestran que es posible aumentar significativamente la precisión de los modelos preentrenados más pequeños, lo que permite un alto rendimiento a un menor costo computacional. La siguiente contribución de esta tesis aborda el problema de la previsión de ventas en el campo de la logística. Se proponen dos sistemas de extremo a extremo con dos técnicas diferentes de aprendizaje profundo (modelos de secuencia a secuencia y transformadores). Los resultados de este capítulo concluyen que es posible construir sistemas integrales para predecir las ventas de múltiples productos individuales, en múltiples puntos de venta y en diferentes momentos con un único modelo de aprendizaje automático. El modelo propuesto supera las alternativas encontradas en la literatura. Finalmente, las dos últimas contribuciones pertenecen al campo de la tecnología del habla. El primero estudia cómo construir un sistema de reconocimiento de voz Keyword Spotting utilizando una versión eficiente de una red neuronal convolucional. En este estudio, el sistema propuesto es capaz de superar el rendimiento de todos los puntos de referencia encontrados en la literatura cuando se prueba contra las subtareas más complejas. El último estudio propone un modelo independiente de texto a voz de última generación capaz de sintetizar voz inteligible en miles de perfiles de voz, mientras genera un discurso con variaciones de prosodia significativas y expresivas. El enfoque propuesto elimina la dependencia de los modelos anteriores de un sistema de voz adicional, lo que hace que el sistema propuesto sea más eficiente en el tiempo de entrenamiento e inferencia, y permite operaciones fuera de línea y en el dispositivo.Deep learning is the state of the art for several machine learning tasks. Many of these tasks require large amount of computational resources, which limits their adoption in embedded devices. The main goal of this dissertation is to study methods and algorithms that allow to approach problems using deep learning with restricted computational resources. This work also aims at presenting applications of deep learning in industry. The first contribution is a new activation function for deep learning networks: the modulus function. The experiments show that the proposed activation function achieves superior results in computer vision tasks when compared with the alternatives found in the literature. The second contribution is a new strategy to combine pre-trained models using knowledge distillation. The results of this chapter show that it is possible to significantly increase the accuracy of the smallest pre-trained models, allowing high performance at a lower computational cost. The following contribution in this thesis tackles the problem of sales fore- casting in the field of logistics. Two end-to-end systems with two different deep learning techniques (sequence-to-sequence models and transformers) are pro- posed. The results of this chapter conclude that it is possible to build end-to-end systems to predict the sales of multiple individual products, at multiple points of sale and different times with a single machine learning model. The proposed model outperforms the alternatives found in the literature. Finally, the last two contributions belong to the speech technology field. The former, studies how to build a Keyword Spotting speech recognition system using an efficient version of a convolutional neural network. In this study, the proposed system is able to beat the performance of all the benchmarks found in the literature when tested against the most complex subtasks. The latter study proposes a standalone state-of-the-art text-to-speech model capable of synthesizing intelligible voice in thousands of voice profiles, while generating speech with meaningful and expressive prosody variations. The proposed approach removes the dependency of previous models on an additional voice system, which makes the proposed system more efficient at training and inference time, and enables offline and on-device operations

Repositori d'Objectes Digitals per a l'Ensenyament la Recerca i la Cultura

Artificial Intelligence for Multimedia Signal Processing

Author
Publication venue: 'MDPI AG'
Publication date: 16/09/2022
Field of study

Artificial intelligence technologies are also actively applied to broadcasting and multimedia processing technologies. A lot of research has been conducted in a wide variety of fields, such as content creation, transmission, and security, and these attempts have been made in the past two to three years to improve image, video, speech, and other data compression efficiency in areas related to MPEG media processing technology. Additionally, technologies such as media creation, processing, editing, and creating scenarios are very important areas of research in multimedia processing and engineering. This book contains a collection of some topics broadly across advanced computational intelligence algorithms and technologies for emerging multimedia signal processing as: Computer vision field, speech/sound/text processing, and content analysis/information mining

Directory of Open Access Books (DOAB)

Innovative intelligent sensors to objectively understand exercise interventions for older adults

Author: Jianjia Ma (1260270)
Publication venue
Publication date: 01/01/2019
Field of study

The population of most western countries is ageing and, therefore, the ageing issue now matters more than ever. According to the reports of the United Nations in 2017, there were a total of 15.8 million (26.9%) people over 60 years of age in the United Kindom, and the numbers are projected to reach 23.5 million (31.5%) by 2050. Spending on medical treatment and healthcare for older adults accounts for two-fifths of the UK National Health Service (NHS) budget. Keeping older people healthy is a challenge. In general, exercise is believed to benefit both mental and physical health. Specifically, resistance band exercises are proven by many studies that they have potentially positive effects on both mental and physical health. However, treatment using resistance band exercise is usually done in unmonitored environments, such as at home or in a rehabilitation centre; therefore, the exercise cannot be measured and/or quantified accurately. Despite many years of research, the true effectiveness of resistance band exercises remains unclear. [Continues.]</div

Loughborough University Institutional Repository