Search CORE

354 research outputs found

No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models

Author: Chen Zhifeng
Chiu Chung-Cheng
Kannan Anjuli
Kumar Shankar
Lee Seungji
Li Bo
Nguyen Patrick
Prabhavalkar Rohit
Rybach David
Sainath Tara N.
Schogol Vlad
Wu Yonghui
Publication venue
Publication date: 05/12/2017
Field of study

For decades, context-dependent phonemes have been the dominant sub-word unit for conventional acoustic modeling systems. This status quo has begun to be challenged recently by end-to-end models which seek to combine acoustic, pronunciation, and language model components into a single neural network. Such systems, which typically predict graphemes or words, simplify the recognition process since they remove the need for a separate expert-curated pronunciation lexicon to map from phoneme-based units to words. However, there has been little previous work comparing phoneme-based versus grapheme-based sub-word units in the end-to-end modeling framework, to determine whether the gains from such approaches are primarily due to the new probabilistic model, or from the joint learning of the various components with grapheme-based units. In this work, we conduct detailed experiments which are aimed at quantifying the value of phoneme-based pronunciation lexica in the context of end-to-end models. We examine phoneme-based end-to-end models, which are contrasted against grapheme-based ones on a large vocabulary English Voice-search task, where we find that graphemes do indeed outperform phonemes. We also compare grapheme and phoneme-based approaches on a multi-dialect English task, which once again confirm the superiority of graphemes, greatly simplifying the system for recognizing multiple dialects

arXiv.org e-Print Archive

Crossref

Multi-Dialect Speech Recognition With A Single Sequence-To-Sequence Model

Author: Bacchiani Michiel
Chen Zhifeng
Li Bo
Nguyen Patrick
Rao Kanishka
Sainath Tara N.
Sim Khe Chai
Weinstein Eugene
Wu Yonghui
Publication venue
Publication date: 05/12/2017
Field of study

Sequence-to-sequence models provide a simple and elegant solution for building speech recognition systems by folding separate components of a typical system, namely acoustic (AM), pronunciation (PM) and language (LM) models into a single neural network. In this work, we look at one such sequence-to-sequence model, namely listen, attend and spell (LAS), and explore the possibility of training a single model to serve different English dialects, which simplifies the process of training multi-dialect systems without the need for separate AM, PM and LMs for each dialect. We show that simply pooling the data from all dialects into one LAS model falls behind the performance of a model fine-tuned on each dialect. We then look at incorporating dialect-specific information into the model, both by modifying the training targets by inserting the dialect symbol at the end of the original grapheme sequence and also feeding a 1-hot representation of the dialect information into all layers of the model. Experimental results on seven English dialects show that our proposed system is effective in modeling dialect variations within a single LAS model, outperforming a LAS model trained individually on each of the seven dialects by 3.1 ~ 16.5% relative.Comment: submitted to ICASSP 201

arXiv.org e-Print Archive

Crossref

Multilingual Speech Recognition With A Single End-To-End Model

Author: Li Bo
Moreno Pedro
Rao Kanishka
Sainath Tara N.
Toshniwal Shubham
Weinstein Eugene
Weiss Ron J.
Publication venue
Publication date: 15/02/2018
Field of study

Training a conventional automatic speech recognition (ASR) system to support multiple languages is challenging because the sub-word unit, lexicon and word inventories are typically language specific. In contrast, sequence-to-sequence models are well suited for multilingual ASR because they encapsulate an acoustic, pronunciation and language model jointly in a single network. In this work we present a single sequence-to-sequence ASR model trained on 9 different Indian languages, which have very little overlap in their scripts. Specifically, we take a union of language-specific grapheme sets and train a grapheme-based sequence-to-sequence model jointly on data from all languages. We find that this model, which is not explicitly given any information about language identity, improves recognition performance by 21% relative compared to analogous sequence-to-sequence models trained on each language individually. By modifying the model to accept a language identifier as an additional input feature, we further improve performance by an additional 7% relative and eliminate confusion between different languages.Comment: Accepted in ICASSP 201

arXiv.org e-Print Archive

Crossref