12 research outputs found
The Microsoft 2016 Conversational Speech Recognition System
We describe Microsoft's conversational speech recognition system, in which we
combine recent developments in neural-network-based acoustic and language
modeling to advance the state of the art on the Switchboard recognition task.
Inspired by machine learning ensemble techniques, the system uses a range of
convolutional and recurrent neural networks. I-vector modeling and lattice-free
MMI training provide significant gains for all acoustic model architectures.
Language model rescoring with multiple forward and backward running RNNLMs, and
word posterior-based system combination provide a 20% boost. The best single
system uses a ResNet architecture acoustic model with RNNLM rescoring, and
achieves a word error rate of 6.9% on the NIST 2000 Switchboard task. The
combined system has an error rate of 6.2%, representing an improvement over
previously reported results on this benchmark task
The Microsoft 2017 Conversational Speech Recognition System
We describe the 2017 version of Microsoft's conversational speech recognition
system, in which we update our 2016 system with recent developments in
neural-network-based acoustic and language modeling to further advance the
state of the art on the Switchboard speech recognition task. The system adds a
CNN-BLSTM acoustic model to the set of model architectures we combined
previously, and includes character-based and dialog session aware LSTM language
models in rescoring. For system combination we adopt a two-stage approach,
whereby subsets of acoustic models are first combined at the senone/frame
level, followed by a word-level voting via confusion networks. We also added a
confusion network rescoring step after system combination. The resulting system
yields a 5.1\% word error rate on the 2000 Switchboard evaluation set
One Size Does Not Fit All: Quantifying and Exposing the Accuracy-Latency Trade-off in Machine Learning Cloud Service APIs via Tolerance Tiers
Today's cloud service architectures follow a "one size fits all" deployment
strategy where the same service version instantiation is provided to the end
users. However, consumers are broad and different applications have different
accuracy and responsiveness requirements, which as we demonstrate renders the
"one size fits all" approach inefficient in practice. We use a production-grade
speech recognition engine, which serves several thousands of users, and an open
source computer vision based system, to explain our point. To overcome the
limitations of the "one size fits all" approach, we recommend Tolerance Tiers
where each MLaaS tier exposes an accuracy/responsiveness characteristic, and
consumers can programmatically select a tier. We evaluate our proposal on the
CPU-based automatic speech recognition (ASR) engine and cutting-edge neural
networks for image classification deployed on both CPUs and GPUs. The results
show that our proposed approach provides an MLaaS cloud service architecture
that can be tuned by the end API user or consumer to outperform the
conventional "one size fits all" approach.Comment: 2019 IEEE International Symposium on Performance Analysis of Systems
and Software (ISPASS
THE MICROSOFT 2016 CONVERSATIONAL SPEECH RECOGNITION SYSTEM
ABSTRACT We describe Microsoft's conversational speech recognition system, in which we combine recent developments in neural-network-based acoustic and language modeling to advance the state of the art on the Switchboard recognition task. Inspired by machine learning ensemble techniques, the system uses a range of convolutional and recurrent neural networks. I-vector modeling and lattice-free MMI training provide significant gains for all acoustic model architectures. Language model rescoring with multiple forward and backward running RNNLMs, and word posterior-based system combination provide a 20% boost. The best single system uses a ResNet architecture acoustic model with RNNLM rescoring, and achieves a word error rate of 6.9% on the NIST 2000 Switchboard task. The combined system has an error rate of 6.2%, representing an improvement over previously reported results on this benchmark task