679 research outputs found
Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop
We summarize the accomplishments of a multi-disciplinary workshop exploring
the computational and scientific issues surrounding the discovery of linguistic
units (subwords and words) in a language without orthography. We study the
replacement of orthographic transcriptions by images and/or translated text in
a well-resourced language to help unsupervised discovery from raw speech.Comment: Accepted to ICASSP 201
WACO: Word-Aligned Contrastive Learning for Speech Translation
End-to-end Speech Translation (E2E ST) aims to directly translate source
speech into target text. Existing ST methods perform poorly when only extremely
small speech-text data are available for training. We observe that an ST
model's performance closely correlates with its embedding similarity between
speech and source transcript. In this paper, we propose Word-Aligned
COntrastive learning (WACO), a simple and effective method for extremely
low-resource speech-to-text translation. Our key idea is bridging word-level
representations for both speech and text modalities via contrastive learning.
We evaluate WACO and other methods on the MuST-C dataset, a widely used ST
benchmark, and on a low-resource direction Maltese-English from IWSLT 2023. Our
experiments demonstrate that WACO outperforms the best baseline by 9+ BLEU
points with only 1-hour parallel ST data. Code is available at
https://github.com/owaski/WACO.Comment: ACL 2023 Poste
Improving End-to-End Speech Processing by Efficient Text Data Utilization with Latent Synthesis
Training a high performance end-to-end speech (E2E) processing model requires
an enormous amount of labeled speech data, especially in the era of
data-centric artificial intelligence. However, labeled speech data are usually
scarcer and more expensive for collection, compared to textual data. We propose
Latent Synthesis (LaSyn), an efficient textual data utilization framework for
E2E speech processing models. We train a latent synthesizer to convert textual
data into an intermediate latent representation of a pre-trained speech model.
These pseudo acoustic representations of textual data augment acoustic data for
model training. We evaluate LaSyn on low-resource automatic speech recognition
(ASR) and spoken language understanding (SLU) tasks. For ASR, LaSyn improves an
E2E baseline trained on LibriSpeech train-clean-100, with relative word error
rate reductions over 22.3% on different test sets. For SLU, LaSyn improves our
E2E baseline by absolute 4.1% for intent classification accuracy and 3.8% for
slot filling SLU-F1 on SLURP, and absolute 4.49% and 2.25% for exact match (EM)
and EM-Tree accuracies on STOP respectively. With fewer parameters, the results
of LaSyn are competitive to published state-of-the-art works. The results
demonstrate the quality of the augmented training data.Comment: 15 pages, 8 figures, 8 tables, Accepted to EMNLP 2023 Finding
ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation
Joint speech-language training is challenging due to the large demand for
training data and GPU consumption, as well as the modality gap between speech
and language. We present ComSL, a speech-language model built atop a composite
architecture of public pretrained speech-only and language-only models and
optimized data-efficiently for spoken language tasks. Particularly, we propose
to incorporate cross-modality learning into transfer learning and conduct them
simultaneously for downstream tasks in a multi-task learning manner. Our
approach has demonstrated effectiveness in end-to-end speech-to-text
translation tasks, achieving a new state-of-the-art average BLEU score of 31.5
on the multilingual speech to English text translation task for 21 languages,
as measured on the public CoVoST2 evaluation set
- …