530 research outputs found
Improving the Performance of Online Neural Transducer Models
Having a sequence-to-sequence model which can operate in an online fashion is
important for streaming applications such as Voice Search. Neural transducer is
a streaming sequence-to-sequence model, but has shown a significant degradation
in performance compared to non-streaming models such as Listen, Attend and
Spell (LAS). In this paper, we present various improvements to NT.
Specifically, we look at increasing the window over which NT computes
attention, mainly by looking backwards in time so the model still remains
online. In addition, we explore initializing a NT model from a LAS-trained
model so that it is guided with a better alignment. Finally, we explore
including stronger language models such as using wordpiece models, and applying
an external LM during the beam search. On a Voice Search task, we find with
these improvements we can get NT to match the performance of LAS
A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability
In this paper, we introduce our work of building a Streaming Multilingual
Speech Model (SM2), which can transcribe or translate multiple spoken languages
into texts of the target language. The backbone of SM2 is Transformer
Transducer, which has high streaming capability. Instead of human labeled
speech translation (ST) data, SM2 models are trained using weakly supervised
data generated by converting the transcriptions in speech recognition corpora
with a machine translation service. With 351 thousand hours of anonymized
speech training data from 25 languages, SM2 models achieve comparable or even
better ST quality than some recent popular large-scale non-streaming speech
models. More importantly, we show that SM2 has the truly zero-shot capability
when expanding to new target languages, yielding high quality ST results for
{source-speech, target-text} pairs that are not seen during training.Comment: submitted to ICASSP 202
Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme
Computational Linguistics; Germanic Languages; Artificial Intelligence (incl. Robotics); Computing Methodologie
A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues
Sequential data often possesses a hierarchical structure with complex
dependencies between subsequences, such as found between the utterances in a
dialogue. In an effort to model this kind of generative process, we propose a
neural network-based generative architecture, with latent stochastic variables
that span a variable number of time steps. We apply the proposed model to the
task of dialogue response generation and compare it with recent neural network
architectures. We evaluate the model performance through automatic evaluation
metrics and by carrying out a human evaluation. The experiments demonstrate
that our model improves upon recently proposed models and that the latent
variables facilitate the generation of long outputs and maintain the context.Comment: 15 pages, 5 tables, 4 figure
The Whole Truth and Nothing But the Truth: Faithful and Controllable Dialogue Response Generation with Dataflow Transduction and Constrained Decoding
In a real-world dialogue system, generated responses must satisfy several
interlocking constraints: being informative, truthful, and easy to control. The
two predominant paradigms in language generation -- neural language modeling
and rule-based generation -- both struggle to satisfy these constraints. Even
the best neural models are prone to hallucination and omission of information,
while existing formalisms for rule-based generation make it difficult to write
grammars that are both flexible and fluent. We describe a hybrid architecture
for dialogue response generation that combines the strengths of both
approaches. This architecture has two components. First, a rule-based content
selection model defined using a new formal framework called dataflow
transduction, which uses declarative rules to transduce a dialogue agent's
computations (represented as dataflow graphs) into context-free grammars
representing the space of contextually acceptable responses. Second, a
constrained decoding procedure that uses these grammars to constrain the output
of a neural language model, which selects fluent utterances. The resulting
system outperforms both rule-based and learned approaches in human evaluations
of fluency, relevance, and truthfulness
Dialog act guided contextual adapter for personalized speech recognition
Personalization in multi-turn dialogs has been a long standing challenge for
end-to-end automatic speech recognition (E2E ASR) models. Recent work on
contextual adapters has tackled rare word recognition using user catalogs. This
adaptation, however, does not incorporate an important cue, the dialog act,
which is available in a multi-turn dialog scenario. In this work, we propose a
dialog act guided contextual adapter network. Specifically, it leverages dialog
acts to select the most relevant user catalogs and creates queries based on
both -- the audio as well as the semantic relationship between the carrier
phrase and user catalogs to better guide the contextual biasing. On industrial
voice assistant datasets, our model outperforms both the baselines - dialog act
encoder-only model, and the contextual adaptation, leading to the most
improvement over the no-context model: 58% average relative word error rate
reduction (WERR) in the multi-turn dialog scenario, in comparison to the
prior-art contextual adapter, which has achieved 39% WERR over the no-context
model.Comment: Accepted at ICASSP 202
THE CHILD AND THE WORLD: How Children acquire Language
HOW CHILDREN ACQUIRE LANGUAGE
Over the last few decades research into child language acquisition has been revolutionized by the use of ingenious new techniques which allow one to investigate what in fact infants (that is children not yet able to speak) can perceive when exposed to a stream of speech sound, the
discriminations they can make between different speech sounds, differentspeech sound sequences and different words. However on the central features of the mystery, the extraordinarily rapid acquisition of lexicon and complex syntactic structures, little solid progress has been made. The questions being researched are how infants acquire and produce the speech sounds (phonemes) of the community language; how infants find words in the stream of speech; and how they link words to perceived objects or action, that is, discover meanings. In a recent general review in Nature of children's language acquisition, Patricia Kuhl also asked why we do not learn new languages as easily at 50 as at 5 and why computers have not cracked the human linguistic code. The motor theory of language function and origin makes possible a plausible account of child language acquisition generally from which answers can be derived also to these further questions. Why computers so far have been unable to 'crack' the language problem becomes apparent in the light of the motor theory account: computers can have no natural relation between words and their meanings; they have no conceptual store to which the
network of words is linked nor do they have the innate aspects of language functioning - represented by function words; computers have no direct links between speech sounds and movement patterns and they do not have the instantly integrated neural patterning underlying thought - they necessarily operate serially and hierarchically. Adults find the acquisition of a new language much more difficult than children do because they are already neurally committed to the link between the words of their first language and the elements in their conceptual store. A second language being acquired by an adult is in direct
competition for neural space with the network structures established for the first language
Deep Audio-Visual Speech Recognition
The goal of this work is to recognise phrases and sentences being spoken by a
talking face, with or without the audio. Unlike previous works that have
focussed on recognising a limited number of words or phrases, we tackle lip
reading as an open-world problem - unconstrained natural language sentences,
and in the wild videos. Our key contributions are: (1) we compare two models
for lip reading, one using a CTC loss, and the other using a
sequence-to-sequence loss. Both models are built on top of the transformer
self-attention architecture; (2) we investigate to what extent lip reading is
complementary to audio speech recognition, especially when the audio signal is
noisy; (3) we introduce and publicly release a new dataset for audio-visual
speech recognition, LRS2-BBC, consisting of thousands of natural sentences from
British television. The models that we train surpass the performance of all
previous work on a lip reading benchmark dataset by a significant margin.Comment: Accepted for publication by IEEE Transactions on Pattern Analysis and
Machine Intelligenc
- …