10 research outputs found
Data-Efficiency with a Single GPU: An Exploration of Transfer Methods for Small Language Models
Multi-task learning (MTL), instruction tuning, and prompting have recently
been shown to improve the generalizability of large language models to new
tasks. However, the benefits of such methods are less well-documented in
smaller language models, with some studies finding contradictory results. In
this work, we explore and isolate the effects of (i) model size, (ii) general
purpose MTL, (iii) in-domain MTL, (iv) instruction tuning, and (v) few-shot
fine-tuning for models with fewer than 500 million parameters. Our experiments
in the zero-shot setting demonstrate that models gain 31% relative improvement,
on average, from general purpose MTL, with an additional 37.6% relative gain
from in-domain MTL. Contradictory to prior works on large models, we find that
instruction tuning provides a modest 2% performance improvement for small
models
Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding
End-to-end (E2E) spoken language understanding (SLU) systems that generate a
semantic parse from speech have become more promising recently. This approach
uses a single model that utilizes audio and text representations from
pre-trained speech recognition models (ASR), and outperforms traditional
pipeline SLU systems in on-device streaming scenarios. However, E2E SLU systems
still show weakness when text representation quality is low due to ASR
transcription errors. To overcome this issue, we propose a novel E2E SLU system
that enhances robustness to ASR errors by fusing audio and text representations
based on the estimated modality confidence of ASR hypotheses. We introduce two
novel techniques: 1) an effective method to encode the quality of ASR
hypotheses and 2) an effective approach to integrate them into E2E SLU models.
We show accuracy improvements on STOP dataset and share the analysis to
demonstrate the effectiveness of our approach.Comment: INTERSPEECH 202
Deliberation Model for On-Device Spoken Language Understanding
We propose a novel deliberation-based approach to end-to-end (E2E) spoken
language understanding (SLU), where a streaming automatic speech recognition
(ASR) model produces the first-pass hypothesis and a second-pass natural
language understanding (NLU) component generates the semantic parse by
conditioning on both ASR's text and audio embeddings. By formulating E2E SLU as
a generalized decoder, our system is able to support complex compositional
semantic structures. Furthermore, the sharing of parameters between ASR and NLU
makes the system especially suitable for resource-constrained (on-device)
environments; our proposed approach consistently outperforms strong pipeline
NLU baselines by 0.82% to 1.34% across various operating points on the spoken
version of the TOPv2 dataset. We demonstrate that the fusion of text and audio
features, coupled with the system's ability to rewrite the first-pass
hypothesis, makes our approach more robust to ASR errors. Finally, we show that
our approach can significantly reduce the degradation when moving from natural
speech to synthetic speech training, but more work is required to make
text-to-speech (TTS) a viable solution for scaling up E2E SLU.Comment: Submitted to INTERSPEECH 202
Augmenting text for spoken language understanding with Large Language Models
Spoken semantic parsing (SSP) involves generating machine-comprehensible
parses from input speech. Training robust models for existing application
domains represented in training data or extending to new domains requires
corresponding triplets of speech-transcript-semantic parse data, which is
expensive to obtain. In this paper, we address this challenge by examining
methods that can use transcript-semantic parse data (unpaired text) without
corresponding speech. First, when unpaired text is drawn from existing textual
corpora, Joint Audio Text (JAT) and Text-to-Speech (TTS) are compared as ways
to generate speech representations for unpaired text. Experiments on the STOP
dataset show that unpaired text from existing and new domains improves
performance by 2% and 30% in absolute Exact Match (EM) respectively. Second, we
consider the setting when unpaired text is not available in existing textual
corpora. We propose to prompt Large Language Models (LLMs) to generate unpaired
text for existing and new domains. Experiments show that examples and words
that co-occur with intents can be used to generate unpaired text with Llama
2.0. Using the generated text with JAT and TTS for spoken semantic parsing
improves EM on STOP by 1.4% and 2.6% absolute for existing and new domains
respectively.Comment: Submitted to ICASSP 202
STOP: A dataset for Spoken Task Oriented Semantic Parsing
End-to-end spoken language understanding (SLU) predicts intent directly from
audio using a single model. It promises to improve the performance of assistant
systems by leveraging acoustic information lost in the intermediate textual
representation and preventing cascading errors from Automatic Speech
Recognition (ASR). Further, having one unified model has efficiency advantages
when deploying assistant systems on-device. However, the limited number of
public audio datasets with semantic parse labels hinders the research progress
in this area. In this paper, we release the Spoken Task-Oriented semantic
Parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly
available. Additionally, we define low-resource splits to establish a benchmark
for improving SLU when limited labeled data is available. Furthermore, in
addition to the human-recorded audio, we are releasing a TTS-generated version
to benchmark the performance for low-resource domain adaptation of end-to-end
SLU systems. Initial experimentation show end-to-end SLU models performing
slightly worse than their cascaded counterparts, which we hope encourages
future work in this direction
STOP: A dataset for spoken task oriented semantic parsing
International audienceEnd-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model. It promises to improve the performance of assistant systems by leveraging acoustic information lost in the intermediate textual representation and preventing cascading errors from Automatic Speech Recognition (ASR). Further, having one unified model has efficiency advantages when deploying assistant systems on-device. However, the limited number of public audio datasets with semantic parse labels hinders the research progress in this area. In this paper, we release the Spoken Task-Oriented semantic Parsing (STOP) dataset 1 , the largest and most complex SLU dataset publicly available. Additionally, we define low-resource splits to establish a benchmark for improving SLU when limited labeled data is available. Furthermore, in addition to the human-recorded audio, we are releasing a TTS-generated versions to benchmark the performance for low-resource and domain adaptation of end-to-end SLU systems