29,075 research outputs found
Acoustic data-driven lexicon learning based on a greedy pronunciation selection framework
Speech recognition systems for irregularly-spelled languages like English
normally require hand-written pronunciations. In this paper, we describe a
system for automatically obtaining pronunciations of words for which
pronunciations are not available, but for which transcribed data exists. Our
method integrates information from the letter sequence and from the acoustic
evidence. The novel aspect of the problem that we address is the problem of how
to prune entries from such a lexicon (since, empirically, lexicons with too
many entries do not tend to be good for ASR performance). Experiments on
various ASR tasks show that, with the proposed framework, starting with an
initial lexicon of several thousand words, we are able to learn a lexicon which
performs close to a full expert lexicon in terms of WER performance on test
data, and is better than lexicons built using G2P alone or with a pruning
criterion based on pronunciation probability
Phoneme and sentence-level ensembles for speech recognition
We address the question of whether and how boosting and bagging can be used for speech recognition. In order to do this, we compare two different boosting schemes, one at the phoneme level and one at the utterance level, with a phoneme-level bagging scheme. We control for many parameters and other choices, such as the state inference scheme used. In an unbiased experiment, we clearly show that the gain of boosting methods compared to a single hidden Markov model is in all cases only marginal, while bagging significantly outperforms all other methods. We thus conclude that bagging methods, which have so far been overlooked in favour of boosting, should be examined more closely as a potentially useful ensemble learning technique for speech recognition
Speaker diarization of multi-party conversations using participants role information: political debates and professional meetings
Speaker Diarization aims at inferring who spoke when in an audio stream and involves two simultaneous unsupervised tasks: (1) the estimation of the number of speakers, and (2) the association of speech segments to each speaker. Most of the recent efforts in the domain have addressed the problem using machine learning techniques or statistical methods (for a review see [11]) ignoring the fact that the data consists of instances of human conversations
Subword and Crossword Units for CTC Acoustic Models
This paper proposes a novel approach to create an unit set for CTC based
speech recognition systems. By using Byte Pair Encoding we learn an unit set of
an arbitrary size on a given training text. In contrast to using characters or
words as units this allows us to find a good trade-off between the size of our
unit set and the available training data. We evaluate both Crossword units,
that may span multiple word, and Subword units. By combining this approach with
decoding methods using a separate language model we are able to achieve state
of the art results for grapheme based CTC systems.Comment: Current version accepted at Interspeech 201
No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models
For decades, context-dependent phonemes have been the dominant sub-word unit
for conventional acoustic modeling systems. This status quo has begun to be
challenged recently by end-to-end models which seek to combine acoustic,
pronunciation, and language model components into a single neural network. Such
systems, which typically predict graphemes or words, simplify the recognition
process since they remove the need for a separate expert-curated pronunciation
lexicon to map from phoneme-based units to words. However, there has been
little previous work comparing phoneme-based versus grapheme-based sub-word
units in the end-to-end modeling framework, to determine whether the gains from
such approaches are primarily due to the new probabilistic model, or from the
joint learning of the various components with grapheme-based units.
In this work, we conduct detailed experiments which are aimed at quantifying
the value of phoneme-based pronunciation lexica in the context of end-to-end
models. We examine phoneme-based end-to-end models, which are contrasted
against grapheme-based ones on a large vocabulary English Voice-search task,
where we find that graphemes do indeed outperform phonemes. We also compare
grapheme and phoneme-based approaches on a multi-dialect English task, which
once again confirm the superiority of graphemes, greatly simplifying the system
for recognizing multiple dialects
- …