2,198 research outputs found
Comparing Human and Machine Errors in Conversational Speech Transcription
Recent work in automatic recognition of conversational telephone speech (CTS)
has achieved accuracy levels comparable to human transcribers, although there
is some debate how to precisely quantify human performance on this task, using
the NIST 2000 CTS evaluation set. This raises the question what systematic
differences, if any, may be found differentiating human from machine
transcription errors. In this paper we approach this question by comparing the
output of our most accurate CTS recognition system to that of a standard speech
transcription vendor pipeline. We find that the most frequent substitution,
deletion and insertion error types of both outputs show a high degree of
overlap. The only notable exception is that the automatic recognizer tends to
confuse filled pauses ("uh") and backchannel acknowledgments ("uhhuh"). Humans
tend not to make this error, presumably due to the distinctive and opposing
pragmatic functions attached to these words. Furthermore, we quantify the
correlation between human and machine errors at the speaker level, and
investigate the effect of speaker overlap between training and test data.
Finally, we report on an informal "Turing test" asking humans to discriminate
between automatic and human transcription error cases
The Microsoft 2017 Conversational Speech Recognition System
We describe the 2017 version of Microsoft's conversational speech recognition
system, in which we update our 2016 system with recent developments in
neural-network-based acoustic and language modeling to further advance the
state of the art on the Switchboard speech recognition task. The system adds a
CNN-BLSTM acoustic model to the set of model architectures we combined
previously, and includes character-based and dialog session aware LSTM language
models in rescoring. For system combination we adopt a two-stage approach,
whereby subsets of acoustic models are first combined at the senone/frame
level, followed by a word-level voting via confusion networks. We also added a
confusion network rescoring step after system combination. The resulting system
yields a 5.1\% word error rate on the 2000 Switchboard evaluation set
English Broadcast News Speech Recognition by Humans and Machines
With recent advances in deep learning, considerable attention has been given
to achieving automatic speech recognition performance close to human
performance on tasks like conversational telephone speech (CTS) recognition. In
this paper we evaluate the usefulness of these proposed techniques on broadcast
news (BN), a similar challenging task. We also perform a set of recognition
measurements to understand how close the achieved automatic speech recognition
results are to human performance on this task. On two publicly available BN
test sets, DEV04F and RT04, our speech recognition system using LSTM and
residual network based acoustic models with a combination of n-gram and neural
network language models performs at 6.5% and 5.9% word error rate. By achieving
new performance milestones on these test sets, our experiments show that
techniques developed on other related tasks, like CTS, can be transferred to
achieve similar performance. In contrast, the best measured human recognition
performance on these test sets is much lower, at 3.6% and 2.8% respectively,
indicating that there is still room for new techniques and improvements in this
space, to reach human performance levels.Comment: \copyright 2019 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other work
The Microsoft 2016 Conversational Speech Recognition System
We describe Microsoft's conversational speech recognition system, in which we
combine recent developments in neural-network-based acoustic and language
modeling to advance the state of the art on the Switchboard recognition task.
Inspired by machine learning ensemble techniques, the system uses a range of
convolutional and recurrent neural networks. I-vector modeling and lattice-free
MMI training provide significant gains for all acoustic model architectures.
Language model rescoring with multiple forward and backward running RNNLMs, and
word posterior-based system combination provide a 20% boost. The best single
system uses a ResNet architecture acoustic model with RNNLM rescoring, and
achieves a word error rate of 6.9% on the NIST 2000 Switchboard task. The
combined system has an error rate of 6.2%, representing an improvement over
previously reported results on this benchmark task
English Conversational Telephone Speech Recognition by Humans and Machines
One of the most difficult speech recognition tasks is accurate recognition of
human to human communication. Advances in deep learning over the last few years
have produced major speech recognition improvements on the representative
Switchboard conversational corpus. Word error rates that just a few years ago
were 14% have dropped to 8.0%, then 6.6% and most recently 5.8%, and are now
believed to be within striking range of human performance. This then raises two
issues - what IS human performance, and how far down can we still drive speech
recognition error rates? A recent paper by Microsoft suggests that we have
already achieved human performance. In trying to verify this statement, we
performed an independent set of human performance measurements on two
conversational tasks and found that human performance may be considerably
better than what was earlier reported, giving the community a significantly
harder goal to achieve. We also report on our own efforts in this area,
presenting a set of acoustic and language modeling techniques that lowered the
word error rate of our own English conversational telephone LVCSR system to the
level of 5.5%/10.3% on the Switchboard/CallHome subsets of the Hub5 2000
evaluation, which - at least at the writing of this paper - is a new
performance milestone (albeit not at what we measure to be human performance!).
On the acoustic side, we use a score fusion of three models: one LSTM with
multiple feature inputs, a second LSTM trained with speaker-adversarial
multi-task learning and a third residual net (ResNet) with 25 convolutional
layers and time-dilated convolutions. On the language modeling side, we use
word and character LSTMs and convolutional WaveNet-style language models
Building competitive direct acoustics-to-word models for English conversational speech recognition
Direct acoustics-to-word (A2W) models in the end-to-end paradigm have
received increasing attention compared to conventional sub-word based automatic
speech recognition models using phones, characters, or context-dependent hidden
Markov model states. This is because A2W models recognize words from speech
without any decoder, pronunciation lexicon, or externally-trained language
model, making training and decoding with such models simple. Prior work has
shown that A2W models require orders of magnitude more training data in order
to perform comparably to conventional models. Our work also showed this
accuracy gap when using the English Switchboard-Fisher data set. This paper
describes a recipe to train an A2W model that closes this gap and is at-par
with state-of-the-art sub-word based models. We achieve a word error rate of
8.8%/13.9% on the Hub5-2000 Switchboard/CallHome test sets without any decoder
or language model. We find that model initialization, training data order, and
regularization have the most impact on the A2W model performance. Next, we
present a joint word-character A2W model that learns to first spell the word
and then recognize it. This model provides a rich output to the user instead of
simple word hypotheses, making it especially useful in the case of words unseen
or rarely-seen during training.Comment: Submitted to IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), 201
Guiding CTC Posterior Spike Timings for Improved Posterior Fusion and Knowledge Distillation
Conventional automatic speech recognition (ASR) systems trained from
frame-level alignments can easily leverage posterior fusion to improve ASR
accuracy and build a better single model with knowledge distillation.
End-to-end ASR systems trained using the Connectionist Temporal Classification
(CTC) loss do not require frame-level alignment and hence simplify model
training. However, sparse and arbitrary posterior spike timings from CTC models
pose a new set of challenges in posterior fusion from multiple models and
knowledge distillation between CTC models. We propose a method to train a CTC
model so that its spike timings are guided to align with those of a pre-trained
guiding CTC model. As a result, all models that share the same guiding model
have aligned spike timings. We show the advantage of our method in various
scenarios including posterior fusion of CTC models and knowledge distillation
between CTC models with different architectures. With the 300-hour Switchboard
training data, the single word CTC model distilled from multiple models
improved the word error rates to 13.7%/23.1% from 14.9%/24.1% on the Hub5 2000
Switchboard/CallHome test sets without using any data augmentation, language
model, or complex decoder.Comment: Accepted to Interspeech 201
- …