110,140 research outputs found
Data Augmentation for Spoken Language Understanding via Pretrained Models
The training of spoken language understanding (SLU) models often faces the
problem of data scarcity. In this paper, we put forward a data augmentation
method with pretrained language models to boost the variability and accuracy of
generated utterances. Furthermore, we investigate and propose solutions to two
previously overlooked scenarios of data scarcity in SLU: i) Rich-in-Ontology:
ontology information with numerous valid dialogue acts are given; ii)
Rich-in-Utterance: a large number of unlabelled utterances are available.
Empirical results show that our method can produce synthetic training data that
boosts the performance of language understanding models in various scenarios.Comment: 6 pages, 1 figur
M2H-GAN: A GAN-based Mapping from Machine to Human Transcripts for Speech Understanding
Deep learning is at the core of recent spoken language understanding (SLU)
related tasks. More precisely, deep neural networks (DNNs) drastically
increased the performances of SLU systems, and numerous architectures have been
proposed. In the real-life context of theme identification of telephone
conversations, it is common to hold both a human, manual (TRS) and an
automatically transcribed (ASR) versions of the conversations. Nonetheless, and
due to production constraints, only the ASR transcripts are considered to build
automatic classifiers. TRS transcripts are only used to measure the
performances of ASR systems. Moreover, the recent performances in term of
classification accuracy, obtained by DNN related systems are close to the
performances reached by humans, and it becomes difficult to further increase
the performances by only considering the ASR transcripts. This paper proposes
to distillates the TRS knowledge available during the training phase within the
ASR representation, by using a new generative adversarial network called
M2H-GAN to generate a TRS-like version of an ASR document, to improve the theme
identification performances.Comment: Submitted at INTERSPEECH 201
Deep Cascade Multi-task Learning for Slot Filling in Online Shopping Assistant
Slot filling is a critical task in natural language understanding (NLU) for
dialog systems. State-of-the-art approaches treat it as a sequence labeling
problem and adopt such models as BiLSTM-CRF. While these models work relatively
well on standard benchmark datasets, they face challenges in the context of
E-commerce where the slot labels are more informative and carry richer
expressions. In this work, inspired by the unique structure of E-commerce
knowledge base, we propose a novel multi-task model with cascade and residual
connections, which jointly learns segment tagging, named entity tagging and
slot filling. Experiments show the effectiveness of the proposed cascade and
residual structures. Our model has a 14.6% advantage in F1 score over the
strong baseline methods on a new Chinese E-commerce shopping assistant dataset,
while achieving competitive accuracies on a standard dataset. Furthermore,
online test deployed on such dominant E-commerce platform shows 130%
improvement on accuracy of understanding user utterances. Our model has already
gone into production in the E-commerce platform.Comment: AAAI 201
Sequence Discriminative Training for Deep Learning based Acoustic Keyword Spotting
Speech recognition is a sequence prediction problem. Besides employing
various deep learning approaches for framelevel classification, sequence-level
discriminative training has been proved to be indispensable to achieve the
state-of-the-art performance in large vocabulary continuous speech recognition
(LVCSR). However, keyword spotting (KWS), as one of the most common speech
recognition tasks, almost only benefits from frame-level deep learning due to
the difficulty of getting competing sequence hypotheses. The few studies on
sequence discriminative training for KWS are limited for fixed vocabulary or
LVCSR based methods and have not been compared to the state-of-the-art deep
learning based KWS approaches. In this paper, a sequence discriminative
training framework is proposed for both fixed vocabulary and unrestricted
acoustic KWS. Sequence discriminative training for both sequence-level
generative and discriminative models are systematically investigated. By
introducing word-independent phone lattices or non-keyword blank symbols to
construct competing hypotheses, feasible and efficient sequence discriminative
training approaches are proposed for acoustic KWS. Experiments showed that the
proposed approaches obtained consistent and significant improvement in both
fixed vocabulary and unrestricted KWS tasks, compared to previous frame-level
deep learning based acoustic KWS methods.Comment: accepted by Speech Communication, 08/02/201
Sample-efficient Actor-Critic Reinforcement Learning with Supervised Data for Dialogue Management
Deep reinforcement learning (RL) methods have significant potential for
dialogue policy optimisation. However, they suffer from a poor performance in
the early stages of learning. This is especially problematic for on-line
learning with real users. Two approaches are introduced to tackle this problem.
Firstly, to speed up the learning process, two sample-efficient neural networks
algorithms: trust region actor-critic with experience replay (TRACER) and
episodic natural actor-critic with experience replay (eNACER) are presented.
For TRACER, the trust region helps to control the learning step size and avoid
catastrophic model changes. For eNACER, the natural gradient identifies the
steepest ascent direction in policy space to speed up the convergence. Both
models employ off-policy learning with experience replay to improve
sample-efficiency. Secondly, to mitigate the cold start issue, a corpus of
demonstration data is utilised to pre-train the models prior to on-line
reinforcement learning. Combining these two approaches, we demonstrate a
practical approach to learn deep RL-based dialogue policies and demonstrate
their effectiveness in a task-oriented information seeking domain.Comment: Accepted as a long paper in SigDial 201
Mitigating the Impact of Speech Recognition Errors on Chatbot using Sequence-to-Sequence Model
We apply sequence-to-sequence model to mitigate the impact of speech
recognition errors on open domain end-to-end dialog generation. We cast the
task as a domain adaptation problem where ASR transcriptions and original text
are in two different domains. In this paper, our proposed model includes two
individual encoders for each domain data and make their hidden states similar
to ensure the decoder predict the same dialog text. The method shows that the
sequence-to-sequence model can learn the ASR transcriptions and original text
pair having the same meaning and eliminate the speech recognition errors.
Experimental results on Cornell movie dialog dataset demonstrate that the
domain adaption system help the spoken dialog system generate more similar
responses with the original text answers.Comment: Accepted at ASRU 201
Neural Approaches to Conversational AI
The present paper surveys neural approaches to conversational AI that have
been developed in the last few years. We group conversational systems into
three categories: (1) question answering agents, (2) task-oriented dialogue
agents, and (3) chatbots. For each category, we present a review of
state-of-the-art neural approaches, draw the connection between them and
traditional approaches, and discuss the progress that has been made and
challenges still being faced, using specific systems and models as case
studies.Comment: Foundations and Trends in Information Retrieval (95 pages
A Scalable Neural Shortlisting-Reranking Approach for Large-Scale Domain Classification in Natural Language Understanding
Intelligent personal digital assistants (IPDAs), a popular real-life
application with spoken language understanding capabilities, can cover
potentially thousands of overlapping domains for natural language
understanding, and the task of finding the best domain to handle an utterance
becomes a challenging problem on a large scale. In this paper, we propose a set
of efficient and scalable neural shortlisting-reranking models for large-scale
domain classification in IPDAs. The shortlisting stage focuses on efficiently
trimming all domains down to a list of k-best candidate domains, and the
reranking stage performs a list-wise reranking of the initial k-best domains
with additional contextual information. We show the effectiveness of our
approach with extensive experiments on 1,500 IPDA domains.Comment: Accepted to NAACL 201
Speech-Based Visual Question Answering
This paper introduces speech-based visual question answering (VQA), the task
of generating an answer given an image and a spoken question. Two methods are
studied: an end-to-end, deep neural network that directly uses audio waveforms
as input versus a pipelined approach that performs ASR (Automatic Speech
Recognition) on the question, followed by text-based visual question answering.
Furthermore, we investigate the robustness of both methods by injecting various
levels of noise into the spoken question and find both methods to be tolerate
noise at similar levels
Efficient Large-Scale Domain Classification with Personalized Attention
In this paper, we explore the task of mapping spoken language utterances to
one of thousands of natural language understanding domains in intelligent
personal digital assistants (IPDAs). This scenario is observed for many
mainstream IPDAs in industry that allow third parties to develop thousands of
new domains to augment built-in ones to rapidly increase domain coverage and
overall IPDA capabilities. We propose a scalable neural model architecture with
a shared encoder, a novel attention mechanism that incorporates personalization
information and domain-specific classifiers that solves the problem
efficiently. Our architecture is designed to efficiently accommodate new
domains that appear in-between full model retraining cycles with a rapid
bootstrapping mechanism two orders of magnitude faster than retraining. We
account for practical constraints in real-time production systems, and design
to minimize memory footprint and runtime latency. We demonstrate that
incorporating personalization results in significantly more accurate domain
classification in the setting with thousands of overlapping domains.Comment: Accepted to ACL 201
- …