74 research outputs found
On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition
In conventional speech recognition, phoneme-based models outperform
grapheme-based models for non-phonetic languages such as English. The
performance gap between the two typically reduces as the amount of training
data is increased. In this work, we examine the impact of the choice of
modeling unit for attention-based encoder-decoder models. We conduct
experiments on the LibriSpeech 100hr, 460hr, and 960hr tasks, using various
target units (phoneme, grapheme, and word-piece); across all tasks, we find
that grapheme or word-piece models consistently outperform phoneme-based
models, even though they are evaluated without a lexicon or an external
language model. We also investigate model complementarity: we find that we can
improve WERs by up to 9% relative by rescoring N-best lists generated from a
strong word-piece based baseline with either the phoneme or the grapheme model.
Rescoring an N-best list generated by the phonemic system, however, provides
limited improvements. Further analysis shows that the word-piece-based models
produce more diverse N-best hypotheses, and thus lower oracle WERs, than
phonemic models.Comment: To appear in the proceedings of INTERSPEECH 201
ΠΠ± ΠΎΠ΄Π½ΠΎΠΌ ΠΌΠ΅ΡΠΎΠ΄Π΅ ΠΊΠΎΠ½ΡΡΠΎΠ»Ρ ΡΠ°Π±ΠΎΡΠΎΡΠΏΠΎΡΠΎΠ±Π½ΠΎΡΡΠΈ ΡΠ΄Π²ΠΈΠ³Π°ΡΡΠ΅Π³ΠΎ ΡΠ΅Π³ΠΈΡΡΡΠ°
ΠΠΏΠΈΡΡΠ²Π°Π΅ΡΡΡ ΡΡΠ½ΠΊΡΠΈΠΎΠ½Π°Π»ΡΠ½Π°Ρ ΡΡ
Π΅ΠΌΠ° ΡΡΡΡΠΎΠΉΡΡΠ²Π° ΠΊΠΎΠ½ΡΡΠΎΠ»Ρ ΡΠ°Π±ΠΎΡΠΎΡΠΏΠΎΡΠΎΠ±Π½ΠΎΡΡΠΈ ΡΠ΄Π²ΠΈΠ³Π°ΡΡΠ΅Π³ΠΎ ΡΠ΅Π³ΠΈΡΡΡΠ°, ΠΎΡΠ½ΠΎΠ²Π°Π½Π½ΠΎΠ³ΠΎ Π½Π° ΠΌΠ΅ΡΠΎΠ΄Π΅ ΡΡΠ΅ΡΠ° Π²ΡΠ΅ΠΌΠ΅Π½ΠΈ ΡΠ΄Π²ΠΈΠ³Π° Π΅Π΄ΠΈΠ½ΠΈΡΡ ΡΠ΅ΡΠ΅Π· ΡΠ΅Π³ΠΈΡΡΡ. Π ΡΡΡΡΠΎΠΉΡΡΠ²Π΅ ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΠ΅ΡΡΡ Π΄Π²Π΅ Π΄Π²ΡΡ
Π²Ρ
ΠΎΠ΄ΠΎΠ²ΡΠ΅ ΡΡ
Π΅ΠΌΡ ΡΠΎΠ²ΠΏΠ°Π΄Π΅Π½ΠΈΡ, Π»ΠΈΠ½ΠΈΡ Π·Π°Π΄Π΅ΡΠΆΠΊΠΈ
ΠΡΡΠ»Π΅Π΄ΠΎΠ²Π°Π½ΠΈΠ΅ ΠΏΡΠΎΡΠ½ΠΎΡΡΠΈ ΡΠ΅ΡΠΎΠ³ΠΎ ΡΡΠ³ΡΠ½Π° ΠΏΡΠΈ ΡΠ°Π΄ΠΈΠ°Π»ΡΠ½ΠΎΠΌ ΡΠΆΠ°ΡΠΈΠΈ
Π Π΅ΠΊΠΎΠΌΠ΅Π½Π΄Π°ΡΠΈΠΈ ΠΏΠΎ ΠΎΠ±ΠΎΠ±ΡΠ΅Π½ΠΈΡ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠΎΠ² Π³Π΅ΠΎΡΠΈΠ·ΠΈΡΠ΅ΡΠΊΠΈΡ ΡΠ°Π±ΠΎΡ Π½Π° Π·ΠΎΠ»ΠΎΡΠΎΡΡΠ΄Π½ΡΡ ΠΌΠ΅ΡΡΠΎΡΠΎΠΆΠ΄Π΅Π½ΠΈΡΡ ΠΠ°Π±Π°ΠΉΠΊΠ°Π»ΡΡ
No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models
For decades, context-dependent phonemes have been the dominant sub-word unit
for conventional acoustic modeling systems. This status quo has begun to be
challenged recently by end-to-end models which seek to combine acoustic,
pronunciation, and language model components into a single neural network. Such
systems, which typically predict graphemes or words, simplify the recognition
process since they remove the need for a separate expert-curated pronunciation
lexicon to map from phoneme-based units to words. However, there has been
little previous work comparing phoneme-based versus grapheme-based sub-word
units in the end-to-end modeling framework, to determine whether the gains from
such approaches are primarily due to the new probabilistic model, or from the
joint learning of the various components with grapheme-based units.
In this work, we conduct detailed experiments which are aimed at quantifying
the value of phoneme-based pronunciation lexica in the context of end-to-end
models. We examine phoneme-based end-to-end models, which are contrasted
against grapheme-based ones on a large vocabulary English Voice-search task,
where we find that graphemes do indeed outperform phonemes. We also compare
grapheme and phoneme-based approaches on a multi-dialect English task, which
once again confirm the superiority of graphemes, greatly simplifying the system
for recognizing multiple dialects
A baseline system for the transcription of catalan broadcast conversation
The paper describes aspects, methods and results of the development of an automatic transcription system for Catalan broadcast conversation by means of speech recognition. Emphasis is given to Catalan language, acoustic and language
modellingmethods and recognition. Results are discussed in context of phenomena and challenges in spontaneous speech, in particular regarding phoneme duration and feature space reduction.Postprint (published version
- β¦