163,783 research outputs found
Improving End-to-End Speech Processing by Efficient Text Data Utilization with Latent Synthesis
Training a high performance end-to-end speech (E2E) processing model requires
an enormous amount of labeled speech data, especially in the era of
data-centric artificial intelligence. However, labeled speech data are usually
scarcer and more expensive for collection, compared to textual data. We propose
Latent Synthesis (LaSyn), an efficient textual data utilization framework for
E2E speech processing models. We train a latent synthesizer to convert textual
data into an intermediate latent representation of a pre-trained speech model.
These pseudo acoustic representations of textual data augment acoustic data for
model training. We evaluate LaSyn on low-resource automatic speech recognition
(ASR) and spoken language understanding (SLU) tasks. For ASR, LaSyn improves an
E2E baseline trained on LibriSpeech train-clean-100, with relative word error
rate reductions over 22.3% on different test sets. For SLU, LaSyn improves our
E2E baseline by absolute 4.1% for intent classification accuracy and 3.8% for
slot filling SLU-F1 on SLURP, and absolute 4.49% and 2.25% for exact match (EM)
and EM-Tree accuracies on STOP respectively. With fewer parameters, the results
of LaSyn are competitive to published state-of-the-art works. The results
demonstrate the quality of the augmented training data.Comment: 15 pages, 8 figures, 8 tables, Accepted to EMNLP 2023 Finding
- …