Search CORE

12,592 research outputs found

The Microsoft 2017 Conversational Speech Recognition System

Author: Alleva F.
Droppo J.
Huang X.
Stolcke A.
Wu L.
Xiong W.
Publication venue
Publication date: 24/08/2017
Field of study

We describe the 2017 version of Microsoft's conversational speech recognition system, in which we update our 2016 system with recent developments in neural-network-based acoustic and language modeling to further advance the state of the art on the Switchboard speech recognition task. The system adds a CNN-BLSTM acoustic model to the set of model architectures we combined previously, and includes character-based and dialog session aware LSTM language models in rescoring. For system combination we adopt a two-stage approach, whereby subsets of acoustic models are first combined at the senone/frame level, followed by a word-level voting via confusion networks. We also added a confusion network rescoring step after system combination. The resulting system yields a 5.1\% word error rate on the 2000 Switchboard evaluation set

arXiv.org e-Print Archive

Crossref

The Microsoft 2016 Conversational Speech Recognition System

Author: Droppo J.
Huang X.
Seide F.
Seltzer M.
Stolcke A.
Xiong W.
Yu D.
Zweig G.
Publication venue
Publication date: 25/01/2017
Field of study

We describe Microsoft's conversational speech recognition system, in which we combine recent developments in neural-network-based acoustic and language modeling to advance the state of the art on the Switchboard recognition task. Inspired by machine learning ensemble techniques, the system uses a range of convolutional and recurrent neural networks. I-vector modeling and lattice-free MMI training provide significant gains for all acoustic model architectures. Language model rescoring with multiple forward and backward running RNNLMs, and word posterior-based system combination provide a 20% boost. The best single system uses a ResNet architecture acoustic model with RNNLM rescoring, and achieves a word error rate of 6.9% on the NIST 2000 Switchboard task. The combined system has an error rate of 6.2%, representing an improvement over previously reported results on this benchmark task

arXiv.org e-Print Archive

Crossref

English Conversational Telephone Speech Recognition by Humans and Machines

Author: Audhkhasi Kartik
Cui Xiaodong
Dimitriadis Dimitrios
Hall Phil
Kurata Gakuto
Lim Lynn-Li
Picheny Michael
Ramabhadran Bhuvana
Roomi Bergul
Saon George
Sercu Tom
Thomas Samuel
Publication venue
Publication date: 06/03/2017
Field of study

One of the most difficult speech recognition tasks is accurate recognition of human to human communication. Advances in deep learning over the last few years have produced major speech recognition improvements on the representative Switchboard conversational corpus. Word error rates that just a few years ago were 14% have dropped to 8.0%, then 6.6% and most recently 5.8%, and are now believed to be within striking range of human performance. This then raises two issues - what IS human performance, and how far down can we still drive speech recognition error rates? A recent paper by Microsoft suggests that we have already achieved human performance. In trying to verify this statement, we performed an independent set of human performance measurements on two conversational tasks and found that human performance may be considerably better than what was earlier reported, giving the community a significantly harder goal to achieve. We also report on our own efforts in this area, presenting a set of acoustic and language modeling techniques that lowered the word error rate of our own English conversational telephone LVCSR system to the level of 5.5%/10.3% on the Switchboard/CallHome subsets of the Hub5 2000 evaluation, which - at least at the writing of this paper - is a new performance milestone (albeit not at what we measure to be human performance!). On the acoustic side, we use a score fusion of three models: one LSTM with multiple feature inputs, a second LSTM trained with speaker-adversarial multi-task learning and a third residual net (ResNet) with 25 convolutional layers and time-dilated convolutions. On the language modeling side, we use word and character LSTMs and convolutional WaveNet-style language models

arXiv.org e-Print Archive

Crossref

Parsing Speech: A Neural Approach to Integrating Lexical and Acoustic-Prosodic Information

Author: Bansal Mohit
Gimpel Kevin
Livescu Karen
Ostendorf Mari
Toshniwal Shubham
Tran Trang
Publication venue
Publication date: 01/01/2018
Field of study

In conversational speech, the acoustic signal provides cues that help listeners disambiguate difficult parses. For automatically parsing spoken utterances, we introduce a model that integrates transcribed text and acoustic-prosodic features using a convolutional neural network over energy and pitch trajectories coupled with an attention-based recurrent neural network that accepts text and prosodic features. We find that different types of acoustic-prosodic features are individually helpful, and together give statistically significant improvements in parse and disfluency detection F1 scores over a strong text-only baseline. For this study with known sentence boundaries, error analyses show that the main benefit of acoustic-prosodic features is in sentences with disfluencies, attachment decisions are most improved, and transcription errors obscure gains from prosody.Comment: Accepted in NAACL HLT 201

arXiv.org e-Print Archive

Crossref

Advances in All-Neural Speech Recognition

Author: Droppo J.
Stolcke A.
Yu C.
Zweig G.
Publication venue
Publication date: 25/01/2017
Field of study

This paper advances the design of CTC-based all-neural (or end-to-end) speech recognizers. We propose a novel symbol inventory, and a novel iterated-CTC method in which a second system is used to transform a noisy initial output into a cleaner version. We present a number of stabilization and initialization methods we have found useful in training these networks. We evaluate our system on the commonly used NIST 2000 conversational telephony test set, and significantly exceed the previously published performance of similar systems, both with and without the use of an external language model and decoding technology

arXiv.org e-Print Archive

Crossref