Search CORE

10 research outputs found

Lessons from Building Acoustic Models with a Million Hours of Speech

Author: Parthasarathi Sree Hari Krishnan
Strom Nikko
Publication venue
Publication date: 02/04/2019
Field of study

This is a report of our lessons learned building acoustic models from 1 Million hours of unlabeled speech, while labeled speech is restricted to 7,000 hours. We employ student/teacher training on unlabeled data, helping scale out target generation in comparison to confidence model based methods, which require a decoder and a confidence model. To optimize storage and to parallelize target generation, we store high valued logits from the teacher model. Introducing the notion of scheduled learning, we interleave learning on unlabeled and labeled data. To scale distributed training across a large number of GPUs, we use BMUF with 64 GPUs, while performing sequence training only on labeled data with gradient threshold compression SGD using 16 GPUs. Our experiments show that extremely large amounts of data are indeed useful; with little hyper-parameter tuning, we obtain relative WER improvements in the 10 to 20% range, with higher gains in noisier conditions.Comment: "Copyright 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

arXiv.org e-Print Archive

Crossref

Anchored Speech Recognition with Neural Transducers

Author: Jia Junteng
Kalinli Ozlem
Mahadeokar Jay
Moritz Niko
Raj Desh
Wu Chunyang
Zhang Xiaohui
Publication venue
Publication date: 12/03/2023
Field of study

Neural transducers have achieved human level performance on standard speech recognition benchmarks. However, their performance significantly degrades in the presence of cross-talk, especially when the primary speaker has a low signal-to-noise ratio. Anchored speech recognition refers to a class of methods that use information from an anchor segment (e.g., wake-words) to recognize device-directed speech while ignoring interfering background speech. In this paper, we investigate anchored speech recognition to make neural transducers robust to background speech. We extract context information from the anchor segment with a tiny auxiliary network, and use encoder biasing and joiner gating to guide the transducer towards the target speech. Moreover, to improve the robustness of context embedding extraction, we propose auxiliary training objectives to disentangle lexical content from speaking style. We evaluate our methods on synthetic LibriSpeech-based mixtures comprising several SNR and overlap conditions; they improve relative word error rates by 19.6% over a strong baseline, when averaged over all conditions.Comment: To appear at IEEE ICASSP 202

arXiv.org e-Print Archive

Anchor Word based Deep Attractor Network for Multi-Speaker Separation

Author: Qian Jiayi
Publication venue: Yamana Hayato
Publication date: 22/07/2019
Field of study

Waseda University Repository

Institutional Repositories DataBase (IRDB)

Development of an anthropomorphic mobile manipulator with human, machine and environment interaction

Author: Garcia Inês
Gonçalves Fernando
Lopes Gil
Monteiro A. Caetano
Ribeiro A. Fernando
Ribeiro Tiago
Publication venue: 'Centre for Evaluation in Education and Science (CEON/CEES)'
Publication date: 01/01/2019
Field of study

An anthropomorphic mobile manipulator robot (CHARMIE) is being developed by the University of Minho's Automation and Robotics Laboratory (LAR). The robot gathers sensorial information and processes using neural networks, actuating in real time. The robot's two arms allow object and machine interaction. Its anthropomorphic structure is advantageous since machines are designed and optimized for human interaction. Sound output allows it to relay information to workers and provide feedback. Allying these features with communication with a database or remote operator results in establishment of a bridge between the physical environment and virtual domain. The goal is an increase in information flow and accessibility. This paper presents the current state of the project, intended features and how it can contribute to the development of Industry 4.0. Focus is given to already finished work, detailing the methodology used for two of the robot's subsystems: locomotion system; lower limbs of the robot.- This project has been supported by the ALGORITMI Research Centre of University of Minho's School of Engineering

Universidade do Minho: RepositoriUM

Crossref

WAKE WORD DETECTION AND ITS APPLICATIONS

Author: Wang Yiming
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 16/09/2021
Field of study

Always-on spoken language interfaces, e.g. personal digital assistants, rely on a wake word to start processing spoken input. Novel methods are proposed to train a wake word detection system from partially labeled training data, and to use it in on-line applications. In the system, the prerequisite of frame-level alignment is removed, permitting the use of un-transcribed training examples that are annotated only for the presence/absence of the wake word. Also, an FST-based decoder is presented to perform online detection. The suite of methods greatly improve the wake word detection performance across several datasets. A novel neural network for acoustic modeling in wake word detection is also investigated. Specifically, the performance of several variants of chunk-wise streaming Transformers tailored for wake word detection is explored, including looking-ahead to the next chunk, gradient stopping, different positional embedding methods and adding same-layer dependency between chunks. Experiments demonstrate that the proposed Transformer model outperforms the baseline convolutional network significantly with a comparable model size, while still maintaining linear complexity w.r.t. the input length. For the application of the detected wake word in ASR, the problem of improving speech recognition with the help of the detected wake word is investigated. Voice-controlled house-hold devices face the difficulty of performing speech recognition of device-directed speech in the presence of interfering background speech. Two end-to-end models are proposed to tackle this problem with information extracted from the anchored segment. The anchored segment refers to the wake word segment of the audio stream, which contains valuable speaker information that can be used to suppress interfering speech and background noise. A multi-task learning setup is also explored where the ideal mask, obtained from a data synthesis procedure, is used to guide the model training. In addition, a way to synthesize "noisy" speech from "clean" speech is also proposed to mitigate the mismatch between training and test data. The proposed methods show large word error reduction for Amazon Alexa live data with interfering background speech, without sacrificing the performance on clean speech

JScholarship