31 research outputs found
ConfNet2Seq: Full Length Answer Generation from Spoken Questions
Conversational and task-oriented dialogue systems aim to interact with the
user using natural responses through multi-modal interfaces, such as text or
speech. These desired responses are in the form of full-length natural answers
generated over facts retrieved from a knowledge source. While the task of
generating natural answers to questions from an answer span has been widely
studied, there has been little research on natural sentence generation over
spoken content. We propose a novel system to generate full length natural
language answers from spoken questions and factoid answers. The spoken sequence
is compactly represented as a confusion network extracted from a pre-trained
Automatic Speech Recognizer. This is the first attempt towards generating
full-length natural answers from a graph input(confusion network) to the best
of our knowledge. We release a large-scale dataset of 259,788 samples of spoken
questions, their factoid answers and corresponding full-length textual answers.
Following our proposed approach, we achieve comparable performance with best
ASR hypothesis.Comment: Accepted at Text, Speech and Dialogue, 202
Modeling ASR Ambiguity for Dialogue State Tracking Using Word Confusion Networks
Spoken dialogue systems typically use a list of top-N ASR hypotheses for
inferring the semantic meaning and tracking the state of the dialogue. However
ASR graphs, such as confusion networks (confnets), provide a compact
representation of a richer hypothesis space than a top-N ASR list. In this
paper, we study the benefits of using confusion networks with a
state-of-the-art neural dialogue state tracker (DST). We encode the
2-dimensional confnet into a 1-dimensional sequence of embeddings using an
attentional confusion network encoder which can be used with any DST system.
Our confnet encoder is plugged into the state-of-the-art 'Global-locally
Self-Attentive Dialogue State Tacker' (GLAD) model for DST and obtains
significant improvements in both accuracy and inference time compared to using
top-N ASR hypotheses.Comment: Accepted at Interspeech-202
Domain-Aware Dialogue State Tracker for Multi-Domain Dialogue Systems
In task-oriented dialogue systems the dialogue state tracker (DST) component
is responsible for predicting the state of the dialogue based on the dialogue
history. Current DST approaches rely on a predefined domain ontology, a fact
that limits their effective usage for large scale conversational agents, where
the DST constantly needs to be interfaced with ever-increasing services and
APIs. Focused towards overcoming this drawback, we propose a domain-aware
dialogue state tracker, that is completely data-driven and it is modeled to
predict for dynamic service schemas. The proposed model utilizes domain and
slot information to extract both domain and slot specific representations for a
given dialogue, and then uses such representations to predict the values of the
corresponding slot. Integrating this mechanism with a pretrained language model
(i.e. BERT), our approach can effectively learn semantic relations
Scalable Neural Dialogue State Tracking
A Dialogue State Tracker (DST) is a key component in a dialogue system aiming
at estimating the beliefs of possible user goals at each dialogue turn. Most of
the current DST trackers make use of recurrent neural networks and are based on
complex architectures that manage several aspects of a dialogue, including the
user utterance, the system actions, and the slot-value pairs defined in a
domain ontology. However, the complexity of such neural architectures incurs
into a considerable latency in the dialogue state prediction, which limits the
deployments of the models in real-world applications, particularly when task
scalability (i.e. amount of slots) is a crucial factor. In this paper, we
propose an innovative neural model for dialogue state tracking, named Global
encoder and Slot-Attentive decoders (G-SAT), which can predict the dialogue
state with a very low latency time, while maintaining high-level performance.
We report experiments on three different languages (English, Italian, and
German) of the WoZ2.0 dataset, and show that the proposed approach provides
competitive advantages over state-of-art DST systems, both in terms of accuracy
and in terms of time complexity for predictions, being over 15 times faster
than the other systems.Comment: 8 pages, 3 figures, Accepted at ASRU 201