271 research outputs found
Dialogue history integration into end-to-end signal-to-concept spoken language understanding systems
This work investigates the embeddings for representing dialog history in
spoken language understanding (SLU) systems. We focus on the scenario when the
semantic information is extracted directly from the speech signal by means of a
single end-to-end neural network model. We proposed to integrate dialogue
history into an end-to-end signal-to-concept SLU system. The dialog history is
represented in the form of dialog history embedding vectors (so-called
h-vectors) and is provided as an additional information to end-to-end SLU
models in order to improve the system performance. Three following types of
h-vectors are proposed and experimentally evaluated in this paper: (1)
supervised-all embeddings predicting bag-of-concepts expected in the answer of
the user from the last dialog system response; (2) supervised-freq embeddings
focusing on predicting only a selected set of semantic concept (corresponding
to the most frequent errors in our experiments); and (3) unsupervised
embeddings. Experiments on the MEDIA corpus for the semantic slot filling task
demonstrate that the proposed h-vectors improve the model performance.Comment: Accepted for ICASSP 2020 (Submitted: October 21, 2019
Speech To Semantics: Improve ASR and NLU Jointly via All-Neural Interfaces
We consider the problem of spoken language understanding (SLU) of extracting
natural language intents and associated slot arguments or named entities from
speech that is primarily directed at voice assistants. Such a system subsumes
both automatic speech recognition (ASR) as well as natural language
understanding (NLU). An end-to-end joint SLU model can be built to a required
specification opening up the opportunity to deploy on hardware constrained
scenarios like devices enabling voice assistants to work offline, in a privacy
preserving manner, whilst also reducing server costs.
We first present models that extract utterance intent directly from speech
without intermediate text output. We then present a compositional model, which
generates the transcript using the Listen Attend Spell ASR system and then
extracts interpretation using a neural NLU model. Finally, we contrast these
methods to a jointly trained end-to-end joint SLU model, consisting of ASR and
NLU subsystems which are connected by a neural network based interface instead
of text, that produces transcripts as well as NLU interpretation. We show that
the jointly trained model shows improvements to ASR incorporating semantic
information from NLU and also improves NLU by exposing it to ASR confusion
encoded in the hidden layer.Comment: Proceedings of INTERSPEEC
- …