9 research outputs found
Dialogue history integration into end-to-end signal-to-concept spoken language understanding systems
This work investigates the embeddings for representing dialog history in
spoken language understanding (SLU) systems. We focus on the scenario when the
semantic information is extracted directly from the speech signal by means of a
single end-to-end neural network model. We proposed to integrate dialogue
history into an end-to-end signal-to-concept SLU system. The dialog history is
represented in the form of dialog history embedding vectors (so-called
h-vectors) and is provided as an additional information to end-to-end SLU
models in order to improve the system performance. Three following types of
h-vectors are proposed and experimentally evaluated in this paper: (1)
supervised-all embeddings predicting bag-of-concepts expected in the answer of
the user from the last dialog system response; (2) supervised-freq embeddings
focusing on predicting only a selected set of semantic concept (corresponding
to the most frequent errors in our experiments); and (3) unsupervised
embeddings. Experiments on the MEDIA corpus for the semantic slot filling task
demonstrate that the proposed h-vectors improve the model performance.Comment: Accepted for ICASSP 2020 (Submitted: October 21, 2019
A Data Efficient End-To-End Spoken Language Understanding Architecture
End-to-end architectures have been recently proposed for spoken language
understanding (SLU) and semantic parsing. Based on a large amount of data,
those models learn jointly acoustic and linguistic-sequential features. Such
architectures give very good results in the context of domain, intent and slot
detection, their application in a more complex semantic chunking and tagging
task is less easy. For that, in many cases, models are combined with an
external language model to enhance their performance.
In this paper we introduce a data efficient system which is trained
end-to-end, with no additional, pre-trained external module. One key feature of
our approach is an incremental training procedure where acoustic, language and
semantic models are trained sequentially one after the other. The proposed
model has a reasonable size and achieves competitive results with respect to
state-of-the-art while using a small training dataset. In particular, we reach
24.02% Concept Error Rate (CER) on MEDIA/test while training on MEDIA/train
without any additional data.Comment: Accepted to ICASSP 202
Modeling ASR Ambiguity for Dialogue State Tracking Using Word Confusion Networks
Spoken dialogue systems typically use a list of top-N ASR hypotheses for
inferring the semantic meaning and tracking the state of the dialogue. However
ASR graphs, such as confusion networks (confnets), provide a compact
representation of a richer hypothesis space than a top-N ASR list. In this
paper, we study the benefits of using confusion networks with a
state-of-the-art neural dialogue state tracker (DST). We encode the
2-dimensional confnet into a 1-dimensional sequence of embeddings using an
attentional confusion network encoder which can be used with any DST system.
Our confnet encoder is plugged into the state-of-the-art 'Global-locally
Self-Attentive Dialogue State Tacker' (GLAD) model for DST and obtains
significant improvements in both accuracy and inference time compared to using
top-N ASR hypotheses.Comment: Accepted at Interspeech-202
Curriculum-based transfer learning for an effective end-to-end spoken language understanding and domain portability
International audienceWe present an end-to-end approach to extract semantic concepts directly from the speech audio signal. To overcome the lack of data available for this spoken language understanding approach, we investigate the use of a transfer learning strategy based on the principles of curriculum learning. This approach allows us to exploit out-of-domain data that can help to prepare a fully neural architecture. Experiments are carried out on the French MEDIA and PORTMEDIA corpora and show that this end-to-end SLU approach reaches the best results ever published on this task. We compare our approach to a classical pipeline approach that uses ASR, POS tagging, lemmatizer, chunker... and other NLP tools that aim to enrich ASR outputs that feed an SLU text to concepts system. Last, we explore the promising capacity of our end-to-end SLU approach to address the problem of domain portability
Learning speech embeddings for speaker adaptation and speech understanding
In recent years, deep neural network models gained popularity as a modeling approach for many speech processing tasks including automatic speech recognition (ASR) and spoken language understanding (SLU). In this dissertation, there are two main goals. The first goal is to propose modeling approaches in order to learn speaker embeddings for speaker adaptation or to learn semantic speech embeddings. The second goal is to introduce training objectives that achieve fairness for the ASR and SLU problems. In the case of speaker adaptation, we introduce an auxiliary network to an ASR model and learn to simultaneously detect speaker changes and adapt to the speaker in an unsupervised way. We show that this joint model leads to lower error rates as compared to a two-step approach where the signal is segmented into single speaker regions and then fed into an adaptation model. We then reformulate the speaker adaptation problem from a counterfactual fairness point-of-view and introduce objective functions to match the ASR performance of the individuals in the dataset to that of their counterfactual counterparts. We show that we can achieve lower error rate in an ASR system while reducing the performance disparity between protected groups. In the second half of the dissertation, we focus on SLU and tackle two problems associated with SLU datasets. The first SLU problem is the lack of large speech corpora. To handle this issue, we propose to use available non-parallel text data so that we can leverage the information in text to guide learning of the speech embeddings. We show that this technique increases the intent classification accuracy as compared to a speech-only system. The second SLU problem is the label imbalance problem in the datasets, which is also related to fairness since a model trained on skewed data usually leads to biased results. To achieve fair SLU, we propose to maximize the F-measure instead of conventional cross-entropy minimization and show that it is possible to increase the number of classes with nonzero recall. In the last two chapters, we provide additional discussions on the impact of these projects from both technical and social perspectives, propose directions for future research and summarize the findings