31 research outputs found
End-to-End Spoken Language Understanding using RNN-Transducer ASR
We propose an end-to-end trained spoken language understanding (SLU) system
that extracts transcripts, intents and slots from an input speech utterance. It
consists of a streaming recurrent neural network transducer (RNNT) based
automatic speech recognition (ASR) model connected to a neural natural language
understanding (NLU) model through a neural interface. This interface allows for
end-to-end training using multi-task RNNT and NLU losses. Additionally, we
introduce semantic sequence loss training for the joint RNNT-NLU system that
allows direct optimization of non-differentiable SLU metrics. This end-to-end
SLU model paradigm can leverage state-of-the-art advancements and pretrained
models in both ASR and NLU research communities, outperforming recently
proposed direct speech-to-semantics models, and conventional pipelined ASR and
NLU systems. We show that this method improves both ASR and NLU metrics on both
public SLU datasets and large proprietary datasets
Confusion2vec 2.0: Enriching Ambiguous Spoken Language Representations with Subwords
Word vector representations enable machines to encode human language for
spoken language understanding and processing. Confusion2vec, motivated from
human speech production and perception, is a word vector representation which
encodes ambiguities present in human spoken language in addition to semantics
and syntactic information. Confusion2vec provides a robust spoken language
representation by considering inherent human language ambiguities. In this
paper, we propose a novel word vector space estimation by unsupervised learning
on lattices output by an automatic speech recognition (ASR) system. We encode
each word in confusion2vec vector space by its constituent subword character
n-grams. We show the subword encoding helps better represent the acoustic
perceptual ambiguities in human spoken language via information modeled on
lattice structured ASR output. The usefulness of the proposed Confusion2vec
representation is evaluated using semantic, syntactic and acoustic analogy and
word similarity tasks. We also show the benefits of subword modeling for
acoustic ambiguity representation on the task of spoken language intent
detection. The results significantly outperform existing word vector
representations when evaluated on erroneous ASR outputs. We demonstrate that
Confusion2vec subword modeling eliminates the need for retraining/adapting the
natural language understanding models on ASR transcripts
Evaluating Novel Speech Transcription Architectures on the Spanish RTVE2020 Database
This work presents three novel speech recognition architectures evaluated on the Spanish RTVE2020 dataset, employed as the main evaluation set in the Albayzín S2T Transcription Challenge 2020. The main objective was to improve the performance of the systems previously submitted by the authors to the challenge, in which the primary system scored the second position. The novel systems are based on both DNN-HMM and E2E acoustic models, for which fully-and self-supervised learning methods were included. As a result, the new speech recognition engines clearly outper-formed the performance of the initial systems from the previous best WER of 19.27 to the new best of 17.60 achieved by the DNN-HMM based system. This work therefore describes an interesting benchmark of the latest acoustic models over a highly challenging dataset, and identifies the most optimal ones depending on the expected quality, the available resources and the required latency