4 research outputs found
An online sequence-to-sequence model for noisy speech recognition
Generative models have long been the dominant approach for speech
recognition. The success of these models however relies on the use of
sophisticated recipes and complicated machinery that is not easily accessible
to non-practitioners. Recent innovations in Deep Learning have given rise to an
alternative - discriminative models called Sequence-to-Sequence models, that
can almost match the accuracy of state of the art generative models. While
these models are easy to train as they can be trained end-to-end in a single
step, they have a practical limitation that they can only be used for offline
recognition. This is because the models require that the entirety of the input
sequence be available at the beginning of inference, an assumption that is not
valid for instantaneous speech recognition. To address this problem, online
sequence-to-sequence models were recently introduced. These models are able to
start producing outputs as data arrives, and the model feels confident enough
to output partial transcripts. These models, like sequence-to-sequence are
causal - the output produced by the model until any time, , affects the
features that are computed subsequently. This makes the model inherently more
powerful than generative models that are unable to change features that are
computed from the data. This paper highlights two main contributions - an
improvement to online sequence-to-sequence model training, and its application
to noisy settings with mixed speech from two speakers.Comment: arXiv admin note: substantial text overlap with arXiv:1608.0128
Exploring Neural Transducers for End-to-End Speech Recognition
In this work, we perform an empirical comparison among the CTC,
RNN-Transducer, and attention-based Seq2Seq models for end-to-end speech
recognition. We show that, without any language model, Seq2Seq and
RNN-Transducer models both outperform the best reported CTC models with a
language model, on the popular Hub5'00 benchmark. On our internal diverse
dataset, these trends continue - RNNTransducer models rescored with a language
model after beam search outperform our best CTC models. These results simplify
the speech recognition pipeline so that decoding can now be expressed purely as
neural network operations. We also study how the choice of encoder architecture
affects the performance of the three models - when all encoder layers are
forward only, and when encoders downsample the input representation
aggressively
Alignment Knowledge Distillation for Online Streaming Attention-based Speech Recognition
This article describes an efficient training method for online streaming
attention-based encoder-decoder (AED) automatic speech recognition (ASR)
systems. AED models have achieved competitive performance in offline scenarios
by jointly optimizing all components. They have recently been extended to an
online streaming framework via models such as monotonic chunkwise attention
(MoChA). However, the elaborate attention calculation process is not robust for
long-form speech utterances. Moreover, the sequence-level training objective
and time-restricted streaming encoder cause a nonnegligible delay in token
emission during inference. To address these problems, we propose CTC
synchronous training (CTC-ST), in which CTC alignments are leveraged as a
reference for token boundaries to enable a MoChA model to learn optimal
monotonic input-output alignments. We formulate a purely end-to-end training
objective to synchronize the boundaries of MoChA to those of CTC. The CTC model
shares an encoder with the MoChA model to enhance the encoder representation.
Moreover, the proposed method provides alignment information learned in the CTC
branch to the attention-based decoder. Therefore, CTC-ST can be regarded as
self-distillation of alignment knowledge from CTC to MoChA. Experimental
evaluations on a variety of benchmark datasets show that the proposed method
significantly reduces recognition errors and emission latency simultaneously,
especially for long-form and noisy speech. We also compare CTC-ST with several
methods that distill alignment knowledge from a hybrid ASR system and show that
the CTC-ST can achieve a comparable tradeoff of accuracy and latency without
relying on external alignment information. The best MoChA system shows
performance comparable to that of RNN-transducer (RNN-T)
Deep Learning Based Chatbot Models
A conversational agent (chatbot) is a piece of software that is able to
communicate with humans using natural language. Modeling conversation is an
important task in natural language processing and artificial intelligence.
While chatbots can be used for various tasks, in general they have to
understand users' utterances and provide responses that are relevant to the
problem at hand.
In my work, I conduct an in-depth survey of recent literature, examining over
70 publications related to chatbots published in the last 3 years. Then, I
proceed to make the argument that the very nature of the general conversation
domain demands approaches that are different from current state-of-of-the-art
architectures. Based on several examples from the literature I show why current
chatbot models fail to take into account enough priors when generating
responses and how this affects the quality of the conversation. In the case of
chatbots, these priors can be outside sources of information that the
conversation is conditioned on like the persona or mood of the conversers. In
addition to presenting the reasons behind this problem, I propose several ideas
on how it could be remedied.
The next section focuses on adapting the very recent Transformer model to the
chatbot domain, which is currently state-of-the-art in neural machine
translation. I first present experiments with the vanilla model, using
conversations extracted from the Cornell Movie-Dialog Corpus. Secondly, I
augment the model with some of my ideas regarding the issues of encoder-decoder
architectures. More specifically, I feed additional features into the model
like mood or persona together with the raw conversation data. Finally, I
conduct a detailed analysis of how the vanilla model performs on conversational
data by comparing it to previous chatbot models and how the additional features
affect the quality of the generated responses.Comment: 67 pages. Written in October of 2017 for a university conference. In
April of 2019, it won first place at the Hungarian Scientific Students'
Associations Report, which is a national competition-like conference for
student