Streaming End-to-end Speech Recognition For Mobile Devices

Alvarez, Raziel; Bagby, Tom; Bhatia, Deepti; Chang, Shuo-yiin; Gruenstein, Alexander; He, Yanzhang; Kannan, Anjuli; Li, Bo; Liang, Qiao; McGraw, Ian; Pang, Ruoming; Prabhavalkar, Rohit; Pundak, Golan; Rao, Kanishka; Rybach, David; Sainath, Tara N.; Shangguan, Yuan; Sim, Khe Chai; Wu, Yonghui; Zhao, Ding

slides

Streaming End-to-end Speech Recognition For Mobile Devices

Authors: Raziel Alvarez
Tom Bagby
Deepti Bhatia
Shuo-yiin Chang
Alexander Gruenstein
Yanzhang He
Anjuli Kannan
Bo Li
Qiao Liang
Ian McGraw
Ruoming Pang
Rohit Prabhavalkar
Golan Pundak
Kanishka Rao
David Rybach
Tara N. Sainath
Yuan Shangguan
Khe Chai Sim
Yonghui Wu
Ding Zhao
Publication date: 15 November 2018
Publisher
Doi

Abstract

End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specific context (e.g., contact lists); and above all, they must be extremely accurate. In this work, we describe our efforts at building an E2E speech recognizer using a recurrent neural network transducer. In experimental evaluations, we find that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy in a number of evaluation categories

Similar works

Full text

Available Versions

Crossref

Last time updated on 10/08/2021