67 research outputs found
Towards end-to-end spoken language understanding
Spoken language understanding system is traditionally designed as a pipeline
of a number of components. First, the audio signal is processed by an automatic
speech recognizer for transcription or n-best hypotheses. With the recognition
results, a natural language understanding system classifies the text to
structured data as domain, intent and slots for down-streaming consumers, such
as dialog system, hands-free applications. These components are usually
developed and optimized independently. In this paper, we present our study on
an end-to-end learning system for spoken language understanding. With this
unified approach, we can infer the semantic meaning directly from audio
features without the intermediate text representation. This study showed that
the trained model can achieve reasonable good result and demonstrated that the
model can capture the semantic attention directly from the audio features.Comment: submitted to ICASSP 201
End-to-end architectures for ASR-free spoken language understanding
Spoken Language Understanding (SLU) is the problem of extracting the meaning
from speech utterances. It is typically addressed as a two-step problem, where
an Automatic Speech Recognition (ASR) model is employed to convert speech into
text, followed by a Natural Language Understanding (NLU) model to extract
meaning from the decoded text. Recently, end-to-end approaches were emerged,
aiming at unifying the ASR and NLU into a single SLU deep neural architecture,
trained using combinations of ASR and NLU-level recognition units. In this
paper, we explore a set of recurrent architectures for intent classification,
tailored to the recently introduced Fluent Speech Commands (FSC) dataset, where
intents are formed as combinations of three slots (action, object, and
location). We show that by combining deep recurrent architectures with standard
data augmentation, state-of-the-art results can be attained, without using
ASR-level targets or pretrained ASR models. We also investigate its
generalizability to new wordings, and we show that the model can perform
reasonably well on wordings unseen during training.Comment: Accepted at ICASSP-202
Capsule Networks for Low Resource Spoken Language Understanding
Designing a spoken language understanding system for command-and-control
applications can be challenging because of a wide variety of domains and users
or because of a lack of training data. In this paper we discuss a system that
learns from scratch from user demonstrations. This method has the advantage
that the same system can be used for many domains and users without
modifications and that no training data is required prior to deployment. The
user is required to train the system, so for a user friendly experience it is
crucial to minimize the required amount of data. In this paper we investigate
whether a capsule network can make efficient use of the limited amount of
available training data. We compare the proposed model to an approach based on
Non-negative Matrix Factorisation which is the state-of-the-art in this setting
and another deep learning approach that was recently introduced for end-to-end
spoken language understanding. We show that the proposed model outperforms the
baseline models for three command-and-control applications: controlling a small
robot, a vocally guided card game and a home automation task.Comment: Submitted to INTERSPEECH 201
End-to-end named entity extraction from speech
Named entity recognition (NER) is among SLU tasks that usually extract
semantic information from textual documents. Until now, NER from speech is made
through a pipeline process that consists in processing first an automatic
speech recognition (ASR) on the audio and then processing a NER on the ASR
outputs. Such approach has some disadvantages (error propagation, metric to
tune ASR systems sub-optimal in regards to the final task, reduced space search
at the ASR output level...) and it is known that more integrated approaches
outperform sequential ones, when they can be applied. In this paper, we present
a first study of end-to-end approach that directly extracts named entities from
speech, though a unique neural architecture. On a such way, a joint
optimization is able for both ASR and NER. Experiments are carried on French
data easily accessible, composed of data distributed in several evaluation
campaign. Experimental results show that this end-to-end approach provides
better results (F-measure=0.69 on test data) than a classical pipeline approach
to detect named entity categories (F-measure=0.65).Comment: Submitted to Interspeech 201
Speech Model Pre-training for End-to-End Spoken Language Understanding
Whereas conventional spoken language understanding (SLU) systems map speech
to text, and then text to intent, end-to-end SLU systems map speech directly to
intent through a single trainable model. Achieving high accuracy with these
end-to-end models without a large amount of training data is difficult. We
propose a method to reduce the data requirements of end-to-end SLU in which the
model is first pre-trained to predict words and phonemes, thus learning good
features for SLU. We introduce a new SLU dataset, Fluent Speech Commands, and
show that our method improves performance both when the full dataset is used
for training and when only a small subset is used. We also describe preliminary
experiments to gauge the model's ability to generalize to new phrases not heard
during training.Comment: Accepted to Interspeech 201
M2H-GAN: A GAN-based Mapping from Machine to Human Transcripts for Speech Understanding
Deep learning is at the core of recent spoken language understanding (SLU)
related tasks. More precisely, deep neural networks (DNNs) drastically
increased the performances of SLU systems, and numerous architectures have been
proposed. In the real-life context of theme identification of telephone
conversations, it is common to hold both a human, manual (TRS) and an
automatically transcribed (ASR) versions of the conversations. Nonetheless, and
due to production constraints, only the ASR transcripts are considered to build
automatic classifiers. TRS transcripts are only used to measure the
performances of ASR systems. Moreover, the recent performances in term of
classification accuracy, obtained by DNN related systems are close to the
performances reached by humans, and it becomes difficult to further increase
the performances by only considering the ASR transcripts. This paper proposes
to distillates the TRS knowledge available during the training phase within the
ASR representation, by using a new generative adversarial network called
M2H-GAN to generate a TRS-like version of an ASR document, to improve the theme
identification performances.Comment: Submitted at INTERSPEECH 201
Approaches to Improving Recognition of Underrepresented Named Entities in Hybrid ASR Systems
In this paper, we present a series of complementary approaches to improve the
recognition of underrepresented named entities (NE) in hybrid ASR systems
without compromising overall word error rate performance. The underrepresented
words correspond to rare or out-of-vocabulary (OOV) words in the training data,
and thereby can't be modeled reliably. We begin with graphemic lexicon which
allows to drop the necessity of phonetic models in hybrid ASR. We study it
under different settings and demonstrate its effectiveness in dealing with
underrepresented NEs. Next, we study the impact of neural language model (LM)
with letter-based features derived to handle infrequent words. After that, we
attempt to enrich representations of underrepresented NEs in pretrained neural
LM by borrowing the embedding representations of rich-represented words. This
let us gain significant performance improvement on underrepresented NE
recognition. Finally, we boost the likelihood scores of utterances containing
NEs in the word lattices rescored by neural LMs and gain further performance
improvement. The combination of the aforementioned approaches improves NE
recognition by up to 42% relatively
SpeechBERT: An Audio-and-text Jointly Learned Language Model for End-to-end Spoken Question Answering
While various end-to-end models for spoken language understanding tasks have
been explored recently, this paper is probably the first known attempt to
challenge the very difficult task of end-to-end spoken question answering
(SQA). Learning from the very successful BERT model for various text processing
tasks, here we proposed an audio-and-text jointly learned SpeechBERT model.
This model outperformed the conventional approach of cascading ASR with the
following text question answering (TQA) model on datasets including ASR errors
in answer spans, because the end-to-end model was shown to be able to extract
information out of audio data before ASR produced errors. When ensembling the
proposed end-to-end model with the cascade architecture, even better
performance was achieved. In addition to the potential of end-to-end SQA, the
SpeechBERT can also be considered for many other spoken language understanding
tasks just as BERT for many text processing tasks.Comment: Interspeech 202
From Audio to Semantics: Approaches to end-to-end spoken language understanding
Conventional spoken language understanding systems consist of two main
components: an automatic speech recognition module that converts audio to a
transcript, and a natural language understanding module that transforms the
resulting text (or top N hypotheses) into a set of domains, intents, and
arguments. These modules are typically optimized independently. In this paper,
we formulate audio to semantic understanding as a sequence-to-sequence problem
[1]. We propose and compare various encoder-decoder based approaches that
optimize both modules jointly, in an end-to-end manner. Evaluations on a
real-world task show that 1) having an intermediate text representation is
crucial for the quality of the predicted semantics, especially the intent
arguments and 2) jointly optimizing the full system improves overall accuracy
of prediction. Compared to independently trained models, our best jointly
trained model achieves similar domain and intent prediction F1 scores, but
improves argument word error rate by 18% relative
Improving the Robustness of Speech Translation
Although neural machine translation (NMT) has achieved impressive progress
recently, it is usually trained on the clean parallel data set and hence cannot
work well when the input sentence is the production of the automatic speech
recognition (ASR) system due to the enormous errors in the source. To solve
this problem, we propose a simple but effective method to improve the
robustness of NMT in the case of speech translation. We simulate the noise
existing in the realistic output of the ASR system and inject them into the
clean parallel data so that NMT can work under similar word distributions
during training and testing. Besides, we also incorporate the Chinese Pinyin
feature which is easy to get in speech translation to further improve the
translation performance. Experiment results show that our method has a more
stable performance and outperforms the baseline by an average of 3.12 BLEU on
multiple noisy test sets, even while achieves a generalization improvement on
the WMT'17 Chinese-English test set
- …