2 research outputs found
Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs
A major focus of recent research in spoken language understanding (SLU) has
been on the end-to-end approach where a single model can predict intents
directly from speech inputs without intermediate transcripts. However, this
approach presents some challenges. First, since speech can be considered as
personally identifiable information, in some cases only automatic speech
recognition (ASR) transcripts are accessible. Second, intent-labeled speech
data is scarce. To address the first challenge, we propose a novel system that
can predict intents from flexible types of inputs: speech, ASR transcripts, or
both. We demonstrate strong performance for either modality separately, and
when both speech and ASR transcripts are available, through system combination,
we achieve better results than using a single input modality. To address the
second challenge, we leverage a semantically robust pre-trained BERT model and
adopt a cross-modal system that co-trains text embeddings and acoustic
embeddings in a shared latent space. We further enhance this system by
utilizing an acoustic module pre-trained on LibriSpeech and domain-adapting the
text module on our target datasets. Our experiments show significant advantages
for these pre-training and fine-tuning strategies, resulting in a system that
achieves competitive intent-classification performance on Snips SLU and Fluent
Speech Commands datasets.Comment: Accepted to Interspeech 202
End-to-End Spoken Language Understanding Without Full Transcripts
An essential component of spoken language understanding (SLU) is slot
filling: representing the meaning of a spoken utterance using semantic entity
labels. In this paper, we develop end-to-end (E2E) spoken language
understanding systems that directly convert speech input to semantic entities
and investigate if these E2E SLU models can be trained solely on semantic
entity annotations without word-for-word transcripts. Training such models is
very useful as they can drastically reduce the cost of data collection. We
created two types of such speech-to-entities models, a CTC model and an
attention-based encoder-decoder model, by adapting models trained originally
for speech recognition. Given that our experiments involve speech input, these
systems need to recognize both the entity label and words representing the
entity value correctly. For our speech-to-entities experiments on the ATIS
corpus, both the CTC and attention models showed impressive ability to skip
non-entity words: there was little degradation when trained on just entities
versus full transcripts. We also explored the scenario where the entities are
in an order not necessarily related to spoken order in the utterance. With its
ability to do re-ordering, the attention model did remarkably well, achieving
only about 2% degradation in speech-to-bag-of-entities F1 score.Comment: 5 pages, to be published in Interspeech 202