71 research outputs found

    Improving Voice Trigger Detection with Metric Learning

    Full text link
    Voice trigger detection is an important task, which enables activating a voice assistant when a target user speaks a keyword phrase. A detector is typically trained on speech data independent of speaker information and used for the voice trigger detection task. However, such a speaker independent voice trigger detector typically suffers from performance degradation on speech from underrepresented groups, such as accented speakers. In this work, we propose a novel voice trigger detector that can use a small number of utterances from a target speaker to improve detection accuracy. Our proposed model employs an encoder-decoder architecture. While the encoder performs speaker independent voice trigger detection, similar to the conventional detector, the decoder predicts a personalized embedding for each utterance. A personalized voice trigger score is then obtained as a similarity score between the embeddings of enrollment utterances and a test utterance. The personalized embedding allows adapting to target speaker's speech when computing the voice trigger score, hence improving voice trigger detection accuracy. Experimental results show that the proposed approach achieves a 38% relative reduction in a false rejection rate (FRR) compared to a baseline speaker independent voice trigger model.Comment: Submitted to InterSpeech 202

    Deep Spoken Keyword Spotting:An Overview

    Get PDF
    Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams and has become a fast-growing technology thanks to the paradigm shift introduced by deep learning a few years ago. This has allowed the rapid embedding of deep KWS in a myriad of small electronic devices with different purposes like the activation of voice assistants. Prospects suggest a sustained growth in terms of social use of this technology. Thus, it is not surprising that deep KWS has become a hot research topic among speech scientists, who constantly look for KWS performance improvement and computational complexity reduction. This context motivates this paper, in which we conduct a literature review into deep spoken KWS to assist practitioners and researchers who are interested in this technology. Specifically, this overview has a comprehensive nature by covering a thorough analysis of deep KWS systems (which includes speech features, acoustic modeling and posterior handling), robustness methods, applications, datasets, evaluation metrics, performance of deep KWS systems and audio-visual KWS. The analysis performed in this paper allows us to identify a number of directions for future research, including directions adopted from automatic speech recognition research and directions that are unique to the problem of spoken KWS

    Energy-Efficient Recurrent Neural Network Accelerators for Real-Time Inference

    Full text link
    Over the past decade, Deep Learning (DL) and Deep Neural Network (DNN) have gone through a rapid development. They are now vastly applied to various applications and have profoundly changed the life of hu- man beings. As an essential element of DNN, Recurrent Neural Networks (RNN) are helpful in processing time-sequential data and are widely used in applications such as speech recognition and machine translation. RNNs are difficult to compute because of their massive arithmetic operations and large memory footprint. RNN inference workloads used to be executed on conventional general-purpose processors including Central Processing Units (CPU) and Graphics Processing Units (GPU); however, they have un- necessary hardware blocks for RNN computation such as branch predictor, caching system, making them not optimal for RNN processing. To accelerate RNN computations and outperform the performance of conventional processors, previous work focused on optimization methods on both software and hardware. On the software side, previous works mainly used model compression to reduce the memory footprint and the arithmetic operations of RNNs. On the hardware side, previous works also designed domain-specific hardware accelerators based on Field Pro- grammable Gate Arrays (FPGA) or Application Specific Integrated Circuits (ASIC) with customized hardware pipelines optimized for efficient pro- cessing of RNNs. By following this software-hardware co-design strategy, previous works achieved at least 10X speedup over conventional processors. Many previous works focused on achieving high throughput with a large batch of input streams. However, in real-time applications, such as gaming Artificial Intellegence (AI), dynamical system control, low latency is more critical. Moreover, there is a trend of offloading neural network workloads to edge devices to provide a better user experience and privacy protection. Edge devices, such as mobile phones and wearable devices, are usually resource-constrained with a tight power budget. They require RNN hard- ware that is more energy-efficient to realize both low-latency inference and long battery life. Brain neurons have sparsity in both the spatial domain and time domain. Inspired by this human nature, previous work mainly explored model compression to induce spatial sparsity in RNNs. The delta network algorithm alternatively induces temporal sparsity in RNNs and can save over 10X arithmetic operations in RNNs proven by previous works. In this work, we have proposed customized hardware accelerators to exploit temporal sparsity in Gated Recurrent Unit (GRU)-RNNs and Long Short-Term Memory (LSTM)-RNNs to achieve energy-efficient real-time RNN inference. First, we have proposed DeltaRNN, the first-ever RNN accelerator to exploit temporal sparsity in GRU-RNNs. DeltaRNN has achieved 1.2 TOp/s effective throughput with a batch size of 1, which is 15X higher than its related works. Second, we have designed EdgeDRNN to accelerate GRU-RNN edge inference. Compared to DeltaRNN, EdgeDRNN does not rely on on-chip memory to store RNN weights and focuses on reducing off-chip Dynamic Random Access Memory (DRAM) data traffic using a more scalable architecture. EdgeDRNN have realized real-time inference of large GRU-RNNs with submillisecond latency and only 2.3 W wall plug power consumption, achieving 4X higher energy efficiency than commercial edge AI platforms like NVIDIA Jetson Nano. Third, we have used DeltaRNN to realize the first-ever continuous speech recognition sys- tem with the Dynamic Audio Sensor (DAS) as the front-end. The DAS is a neuromorphic event-driven sensor that produces a stream of asyn- chronous events instead of audio data sampled at a fixed sample rate. We have also showcased how an RNN accelerator can be integrated with an event-driven sensor on the same chip to realize ultra-low-power Keyword Spotting (KWS) on the extreme edge. Fourth, we have used EdgeDRNN to control a powered robotic prosthesis using an RNN controller to replace a conventional proportional–derivative (PD) controller. EdgeDRNN has achieved 21 μs latency of running the RNN controller and could maintain stable control of the prosthesis. We have used DeltaRNN and EdgeDRNN to solve these problems to prove their value in solving real-world problems. Finally, we have applied the delta network algorithm on LSTM-RNNs and have combined it with a customized structured pruning method, called Column-Balanced Targeted Dropout (CBTD), to induce spatio-temporal sparsity in LSTM-RNNs. Then, we have proposed another FPGA-based accelerator called Spartus, the first RNN accelerator that exploits spatio- temporal sparsity. Spartus achieved 9.4 TOp/s effective throughput with a batch size of 1, the highest among present FPGA-based RNN accelerators with a power budget around 10 W. Spartus can complete the inference of an LSTM layer having 5 million parameters within 1 μs

    Filler Word Detection and Classification: A Dataset and Benchmark

    Full text link
    Filler words such as `uh' or `um' are sounds or words people use to signal they are pausing to think. Finding and removing filler words from recordings is a common and tedious task in media editing. Automatically detecting and classifying filler words could greatly aid in this task, but few studies have been published on this problem. A key reason is the absence of a dataset with annotated filler words for training and evaluation. In this work, we present a novel speech dataset, PodcastFillers, with 35K annotated filler words and 50K annotations of other sounds that commonly occur in podcasts such as breaths, laughter, and word repetitions. We propose a pipeline that leverages VAD and ASR to detect filler candidates and a classifier to distinguish between filler word types. We evaluate our proposed pipeline on PodcastFillers, compare to several baselines, and present a detailed ablation study. In particular, we evaluate the importance of using ASR and how it compares to a transcription-free approach resembling keyword spotting. We show that our pipeline obtains state-of-the-art results, and that leveraging ASR strongly outperforms a keyword spotting approach. We make PodcastFillers publicly available, and hope our work serves as a benchmark for future research.Comment: Submitted to Insterspeech 202

    LOW RESOURCE HIGH ACCURACY KEYWORD SPOTTING

    Get PDF
    Keyword spotting (KWS) is a task to automatically detect keywords of interest in continuous speech, which has been an active research topic for over 40 years. Recently there is a rising demand for KWS techniques in resource constrained conditions. For example, as for the year of 2016, USC Shoah Foundation covers audio-visual testimonies from survivors and other witnesses of the Holocaust in 63 countries and 39 languages, and providing search capability for those testimonies requires substantial KWS technologies in low language resource conditions, as for most languages, resources for developing KWS systems are not as rich as that for English. Despite the fact that KWS has been in the literature for a long time, KWS techniques in resource constrained conditions have not been researched extensively. In this dissertation, we improve KWS performance in two low resource conditions: low language resource condition where language specific data is inadequate, and low computation resource condition where KWS runs on computation constrained devices. For low language resource KWS, we focus on applications for speech data mining, where large vocabulary continuous speech recognition (LVCSR)-based KWS techniques are widely used. Keyword spotting for those applications are also known as keyword search (KWS) or spoken term detection (STD). A key issue for this type of KWS technique is the out-of-vocabulary (OOV) keyword problem. LVCSR-based KWS can only search for words that are defined in the LVCSR's lexicon, which is typically very small in a low language resource condition. To alleviate the OOV keyword problem, we propose a technique named "proxy keyword search" that enables us to search for OOV keywords with regular LVCSR-based KWS systems. We also develop a technique that expands LVCSR's lexicon automatically by adding hallucinated words, which increases keyword coverage and therefore improves KWS performance. Finally we explore the possibility of building LVCSR-based KWS systems with limited lexicon, or even without an expert pronunciation lexicon. For low computation resource KWS, we focus on wake-word applications, which usually run on computation constrained devices such as mobile phones or tablets. We first develop a deep neural network (DNN)-based keyword spotter, which is lightweight and accurate enough that we are able to run it on devices continuously. This keyword spotter typically requires a pre-defined keyword, such as "Okay Google". We then propose a long short-term memory (LSTM)-based feature extractor for query-by-example KWS, which enables the users to define their own keywords

    Vega: A Ten-Core SoC for IoT Endnodes with DNN Acceleration and Cognitive Wake-Up from MRAM-Based State-Retentive Sleep Mode

    Get PDF
    The Internet-of-Things (IoT) requires endnodes with ultra-low-power always-on capability for a long battery lifetime, as well as high performance, energy efficiency, and extreme flexibility to deal with complex and fast-evolving near-sensor analytics algorithms (NSAAs). We present Vega, an IoT endnode system on chip (SoC) capable of scaling from a 1.7- μW fully retentive cognitive sleep mode up to 32.2-GOPS (at 49.4 mW) peak performance on NSAAs, including mobile deep neural network (DNN) inference, exploiting 1.6 MB of state-retentive SRAM, and 4 MB of non-volatile magnetoresistive random access memory (MRAM). To meet the performance and flexibility requirements of NSAAs, the SoC features ten RISC-V cores: one core for SoC and IO management and a nine-core cluster supporting multi-precision single instruction multiple data (SIMD) integer and floating-point (FP) computation. Vega achieves the state-of-the-art (SoA)-leading efficiency of 615 GOPS/W on 8-bit INT computation (boosted to 1.3 TOPS/W for 8-bit DNN inference with hardware acceleration). On FP computation, it achieves the SoA-leading efficiency of 79 and 129 GFLOPS/W on 32- and 16-bit FP, respectively. Two programmable machine learning (ML) accelerators boost energy efficiency in cognitive sleep and active states

    Spoken command recognition for robotics

    Get PDF
    In this thesis, I investigate spoken command recognition technology for robotics. While high robustness is expected, the distant and noisy conditions in which the system has to operate make the task very challenging. Unlike commercial systems which all rely on a "wake-up" word to initiate the interaction, the pipeline proposed here directly detect and recognizes commands from the continuous audio stream. In order to keep the task manageable despite low-resource conditions, I propose to focus on a limited set of commands, thus trading off flexibility of the system against robustness. Domain and speaker adaptation strategies based on a multi-task regularization paradigm are first explored. More precisely, two different methods are proposed which rely on a tied loss function which penalizes the distance between the output of several networks. The first method considers each speaker or domain as a task. A canonical task-independent network is jointly trained with task-dependent models, allowing both types of networks to improve by learning from one another. While an improvement of 3.2% on the frame error rate (FER) of the task-independent network is obtained, this only partially carried over to the phone error rate (PER), with 1.5% of improvement. Similarly, a second method explored the parallel training of the canonical network with a privileged model having access to i-vectors. This method proved less effective with only 1.2% of improvement on the FER. In order to make the developed technology more accessible, I also investigated the use of a sequence-to-sequence (S2S) architecture for command classification. The use of an attention-based encoder-decoder model reduced the classification error by 40% relative to a strong convolutional neural network (CNN)-hidden Markov model (HMM) baseline, showing the relevance of S2S architectures in such context. In order to improve the flexibility of the trained system, I also explored strategies for few-shot learning, which allow to extend the set of commands with minimum requirements in terms of data. Retraining a model on the combination of original and new commands, I managed to achieve 40.5% of accuracy on the new commands with only 10 examples for each of them. This scores goes up to 81.5% of accuracy with a larger set of 100 examples per new command. An alternative strategy, based on model adaptation achieved even better scores, with 68.8% and 88.4% of accuracy with 10 and 100 examples respectively, while being faster to train. This high performance is obtained at the expense of the original categories though, on which the accuracy deteriorated. Those results are very promising as the methods allow to easily extend an existing S2S model with minimal resources. Finally, a full spoken command recognition system (named iCubrec) has been developed for the iCub platform. The pipeline relies on a voice activity detection (VAD) system to propose a fully hand-free experience. By segmenting only regions that are likely to contain commands, the VAD module also allows to reduce greatly the computational cost of the pipeline. Command candidates are then passed to the deep neural network (DNN)-HMM command recognition system for transcription. The VoCub dataset has been specifically gathered to train a DNN-based acoustic model for our task. Through multi-condition training with the CHiME4 dataset, an accuracy of 94.5% is reached on VoCub test set. A filler model, complemented by a rejection mechanism based on a confidence score, is finally added to the system to reject non-command speech in a live demonstration of the system

    Spike encoding techniques for IoT time-varying signals benchmarked on a neuromorphic classification task

    Get PDF
    Spiking Neural Networks (SNNs), known for their potential to enable low energy consumption and computational cost, can bring significant advantages to the realm of embedded machine learning for edge applications. However, input coming from standard digital sensors must be encoded into spike trains before it can be elaborated with neuromorphic computing technologies. We present here a detailed comparison of available spike encoding techniques for the translation of time-varying signals into the event-based signal domain, tested on two different datasets both acquired through commercially available digital devices: the Free Spoken Digit dataset (FSD), consisting of 8-kHz audio files, and the WISDM dataset, composed of 20-Hz recordings of human activity through mobile and wearable inertial sensors. We propose a complete pipeline to benchmark these encoding techniques by performing time-dependent signal classification through a Spiking Convolutional Neural Network (sCNN), including a signal preprocessing step consisting of a bank of filters inspired by the human cochlea, feature extraction by production of a sonogram, transfer learning via an equivalent ANN, and model compression schemes aimed at resource optimization. The resulting performance comparison and analysis provides a powerful practical tool, empowering developers to select the most suitable coding method based on the type of data and the desired processing algorithms, and further expands the applicability of neuromorphic computational paradigms to embedded sensor systems widely employed in the IoT and industrial domains

    Deep Neural Network Architectures for Large-scale, Robust and Small-Footprint Speaker and Language Recognition

    Full text link
    Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Tecnología Electrónica y de las Comunicaciones. Fecha de lectura : 27-04-2017Artificial neural networks are powerful learners of the information embedded in speech signals. They can provide compact, multi-level, nonlinear representations of temporal sequences and holistic optimization algorithms capable of surpassing former leading paradigms. Artificial neural networks are, therefore, a promising technology that can be used to enhance our ability to recognize speakers and languages–an ability increasingly in demand in the context of new, voice-enabled interfaces used today by millions of users. The aim of this thesis is to advance the state-of-the-art of language and speaker recognition through the formulation, implementation and empirical analysis of novel approaches for large-scale and portable speech interfaces. Its major contributions are: (1) novel, compact network architectures for language and speaker recognition, including a variety of network topologies based on fully-connected, recurrent, convolutional, and locally connected layers; (2) a bottleneck combination strategy for classical and neural network approaches for long speech sequences; (3) the architectural design of the first, public, multilingual, large vocabulary continuous speech recognition system; and (4) a novel, end-to-end optimization algorithm for text-dependent speaker recognition that is applicable to a range of verification tasks. Experimental results have demonstrated that artificial neural networks can substantially reduce the number of model parameters and surpass the performance of previous approaches to language and speaker recognition, particularly in the cases of long short-term memory recurrent networks (used to model the input speech signal), end-to-end optimization algorithms (used to predict languages or speakers), short testing utterances, and large training data collections.Las redes neuronales artificiales son sistemas de aprendizaje capaces de extraer la información embebida en las señales de voz. Son capaces de modelar de forma eficiente secuencias temporales complejas, con información no lineal y distribuida en distintos niveles semanticos, mediante el uso de algoritmos de optimización integral con la capacidad potencial de mejorar los sistemas aprendizaje automático existentes. Las redes neuronales artificiales son, pues, una tecnología prometedora para mejorar el reconocimiento automático de locutores e idiomas; siendo el reconocimiento de de locutores e idiomas, tareas con cada vez más demanda en los nuevos sistemas de control por voz, que ya utilizan millones de personas. Esta tesis tiene como objetivo la mejora del estado del arte de las tecnologías de reconocimiento de locutor y de idioma mediante la formulación, implementación y análisis empírico de nuevos enfoques basados en redes neuronales, aplicables a dispositivos portátiles y a su uso en gran escala. Las principales contribuciones de esta tesis incluyen la propuesta original de: (1) arquitecturas eficientes que hacen uso de capas neuronales densas, localmente densas, recurrentes y convolucionales; (2) una nueva estrategia de combinación de enfoques clásicos y enfoques basados en el uso de las denominadas redes de cuello de botella; (3) el diseño del primer sistema público de reconocimiento de voz, de vocabulario abierto y continuo, que es además multilingüe; y (4) la propuesta de un nuevo algoritmo de optimización integral para tareas de reconocimiento de locutor, aplicable también a otras tareas de verificación. Los resultados experimentales extraídos de esta tesis han demostrado que las redes neuronales artificiales son capaces de reducir el número de parámetros usados por los algoritmos de reconocimiento tradicionales, así como de mejorar el rendimiento de dichos sistemas de forma substancial. Dicha mejora relativa puede acentuarse a través del modelado de voz mediante redes recurrentes de memoria a largo plazo, el uso de algoritmos de optimización integral, el uso de locuciones de evaluation de corta duración y mediante la optimización del sistema con grandes cantidades de datos de entrenamiento
    • …
    corecore