176 research outputs found
Acoustic data-driven lexicon learning based on a greedy pronunciation selection framework
Speech recognition systems for irregularly-spelled languages like English
normally require hand-written pronunciations. In this paper, we describe a
system for automatically obtaining pronunciations of words for which
pronunciations are not available, but for which transcribed data exists. Our
method integrates information from the letter sequence and from the acoustic
evidence. The novel aspect of the problem that we address is the problem of how
to prune entries from such a lexicon (since, empirically, lexicons with too
many entries do not tend to be good for ASR performance). Experiments on
various ASR tasks show that, with the proposed framework, starting with an
initial lexicon of several thousand words, we are able to learn a lexicon which
performs close to a full expert lexicon in terms of WER performance on test
data, and is better than lexicons built using G2P alone or with a pruning
criterion based on pronunciation probability
Topic Identification for Speech without ASR
Modern topic identification (topic ID) systems for speech use automatic
speech recognition (ASR) to produce speech transcripts, and perform supervised
classification on such ASR outputs. However, under resource-limited conditions,
the manually transcribed speech required to develop standard ASR systems can be
severely limited or unavailable. In this paper, we investigate alternative
unsupervised solutions to obtaining tokenizations of speech in terms of a
vocabulary of automatically discovered word-like or phoneme-like units, without
depending on the supervised training of ASR systems. Moreover, using automatic
phoneme-like tokenizations, we demonstrate that a convolutional neural network
based framework for learning spoken document representations provides
competitive performance compared to a standard bag-of-words representation, as
evidenced by comprehensive topic ID evaluations on both single-label and
multi-label classification tasks.Comment: 5 pages, 2 figures; accepted for publication at Interspeech 201
GPU-accelerated Guided Source Separation for Meeting Transcription
Guided source separation (GSS) is a type of target-speaker extraction method
that relies on pre-computed speaker activities and blind source separation to
perform front-end enhancement of overlapped speech signals. It was first
proposed during the CHiME-5 challenge and provided significant improvements
over the delay-and-sum beamforming baseline. Despite its strengths, however,
the method has seen limited adoption for meeting transcription benchmarks
primarily due to its high computation time. In this paper, we describe our
improved implementation of GSS that leverages the power of modern GPU-based
pipelines, including batched processing of frequencies and segments, to provide
300x speed-up over CPU-based inference. The improved inference time allows us
to perform detailed ablation studies over several parameters of the GSS
algorithm -- such as context duration, number of channels, and noise class, to
name a few. We provide end-to-end reproducible pipelines for speaker-attributed
transcription of popular meeting benchmarks: LibriCSS, AMI, and AliMeeting. Our
code and recipes are publicly available: https://github.com/desh2608/gss.Comment: 7 pages, 4 figures. Code available at https://github.com/desh2608/gs
Large-scale random forest language models for speech recognition
The random forest language model (RFLM) has shown encouraging results in several automatic speech recognition (ASR) tasks but has been hindered by practical limitations, notably the space-complexity of RFLM estimation from large amounts of data. This paper addresses large-scale training and testing of the RFLM via an efficient disk-swapping strategy that exploits the recursive structure of a binary decision tree and the local access property of the tree-growing algorithm, redeeming the full potential of the RFLM, and opening avenues of further research, including useful comparisons with n-gram models. Benefits of this strategy are demonstrated by perplexity reduction and lattice rescoring experiments using a state-of-the-art ASR system. Index Terms: random forest language model, large-scale training, data scaling, speech recognitio
- …