7,761 research outputs found
The Microsoft 2016 Conversational Speech Recognition System
We describe Microsoft's conversational speech recognition system, in which we
combine recent developments in neural-network-based acoustic and language
modeling to advance the state of the art on the Switchboard recognition task.
Inspired by machine learning ensemble techniques, the system uses a range of
convolutional and recurrent neural networks. I-vector modeling and lattice-free
MMI training provide significant gains for all acoustic model architectures.
Language model rescoring with multiple forward and backward running RNNLMs, and
word posterior-based system combination provide a 20% boost. The best single
system uses a ResNet architecture acoustic model with RNNLM rescoring, and
achieves a word error rate of 6.9% on the NIST 2000 Switchboard task. The
combined system has an error rate of 6.2%, representing an improvement over
previously reported results on this benchmark task
THE MICROSOFT 2016 CONVERSATIONAL SPEECH RECOGNITION SYSTEM
ABSTRACT We describe Microsoft's conversational speech recognition system, in which we combine recent developments in neural-network-based acoustic and language modeling to advance the state of the art on the Switchboard recognition task. Inspired by machine learning ensemble techniques, the system uses a range of convolutional and recurrent neural networks. I-vector modeling and lattice-free MMI training provide significant gains for all acoustic model architectures. Language model rescoring with multiple forward and backward running RNNLMs, and word posterior-based system combination provide a 20% boost. The best single system uses a ResNet architecture acoustic model with RNNLM rescoring, and achieves a word error rate of 6.9% on the NIST 2000 Switchboard task. The combined system has an error rate of 6.2%, representing an improvement over previously reported results on this benchmark task
The Microsoft 2017 Conversational Speech Recognition System
We describe the 2017 version of Microsoft's conversational speech recognition
system, in which we update our 2016 system with recent developments in
neural-network-based acoustic and language modeling to further advance the
state of the art on the Switchboard speech recognition task. The system adds a
CNN-BLSTM acoustic model to the set of model architectures we combined
previously, and includes character-based and dialog session aware LSTM language
models in rescoring. For system combination we adopt a two-stage approach,
whereby subsets of acoustic models are first combined at the senone/frame
level, followed by a word-level voting via confusion networks. We also added a
confusion network rescoring step after system combination. The resulting system
yields a 5.1\% word error rate on the 2000 Switchboard evaluation set
Comparing Human and Machine Errors in Conversational Speech Transcription
Recent work in automatic recognition of conversational telephone speech (CTS)
has achieved accuracy levels comparable to human transcribers, although there
is some debate how to precisely quantify human performance on this task, using
the NIST 2000 CTS evaluation set. This raises the question what systematic
differences, if any, may be found differentiating human from machine
transcription errors. In this paper we approach this question by comparing the
output of our most accurate CTS recognition system to that of a standard speech
transcription vendor pipeline. We find that the most frequent substitution,
deletion and insertion error types of both outputs show a high degree of
overlap. The only notable exception is that the automatic recognizer tends to
confuse filled pauses ("uh") and backchannel acknowledgments ("uhhuh"). Humans
tend not to make this error, presumably due to the distinctive and opposing
pragmatic functions attached to these words. Furthermore, we quantify the
correlation between human and machine errors at the speaker level, and
investigate the effect of speaker overlap between training and test data.
Finally, we report on an informal "Turing test" asking humans to discriminate
between automatic and human transcription error cases
An End-to-End Conversational Style Matching Agent
We present an end-to-end voice-based conversational agent that is able to
engage in naturalistic multi-turn dialogue and align with the interlocutor's
conversational style. The system uses a series of deep neural network
components for speech recognition, dialogue generation, prosodic analysis and
speech synthesis to generate language and prosodic expression with qualities
that match those of the user. We conducted a user study (N=30) in which
participants talked with the agent for 15 to 20 minutes, resulting in over 8
hours of natural interaction data. Users with high consideration conversational
styles reported the agent to be more trustworthy when it matched their
conversational style. Whereas, users with high involvement conversational
styles were indifferent. Finally, we provide design guidelines for multi-turn
dialogue interactions using conversational style adaptation
- …