90 research outputs found
TESSP: Text-Enhanced Self-Supervised Speech Pre-training
Self-supervised speech pre-training empowers the model with the contextual
structure inherent in the speech signal while self-supervised text pre-training
empowers the model with linguistic information. Both of them are beneficial for
downstream speech tasks such as ASR. However, the distinct pre-training
objectives make it challenging to jointly optimize the speech and text
representation in the same model. To solve this problem, we propose
Text-Enhanced Self-Supervised Speech Pre-training (TESSP), aiming to
incorporate the linguistic information into speech pre-training. Our model
consists of three parts, i.e., a speech encoder, a text encoder and a shared
encoder. The model takes unsupervised speech and text data as the input and
leverages the common HuBERT and MLM losses respectively. We also propose
phoneme up-sampling and representation swapping to enable joint modeling of the
speech and text information. Specifically, to fix the length mismatching
problem between speech and text data, we phonemize the text sequence and
up-sample the phonemes with the alignment information extracted from a small
set of supervised data. Moreover, to close the gap between the learned speech
and text representations, we swap the text representation with the speech
representation extracted by the respective private encoders according to the
alignment information. Experiments on the Librispeech dataset shows the
proposed TESSP model achieves more than 10% improvement compared with WavLM on
the test-clean and test-other sets. We also evaluate our model on the SUPERB
benchmark, showing our model has better performance on Phoneme Recognition,
Acoustic Speech Recognition and Speech Translation compared with WavLM.Comment: 9 pages, 4 figure
CIF-PT: Bridging Speech and Text Representations for Spoken Language Understanding via Continuous Integrate-and-Fire Pre-Training
Speech or text representation generated by pre-trained models contains
modal-specific information that could be combined for benefiting spoken
language understanding (SLU) tasks. In this work, we propose a novel
pre-training paradigm termed Continuous Integrate-and-Fire Pre-Training
(CIF-PT). It relies on a simple but effective frame-to-token alignment:
continuous integrate-and-fire (CIF) to bridge the representations between
speech and text. It jointly performs speech-to-text training and language model
distillation through CIF as the pre-training (PT). Evaluated on SLU benchmark
SLURP dataset, CIF-PT outperforms the state-of-the-art model by 1.94% of
accuracy and 2.71% of SLU-F1 on the tasks of intent classification and slot
filling, respectively. We also observe the cross-modal representation extracted
by CIF-PT obtains better performance than other neural interfaces for the tasks
of SLU, including the dominant speech representation learned from
self-supervised pre-training.Comment: Accepted by ACL 2023 Finding
Toponym extraction and disambiguation enhancement using loops of feedback
Toponym extraction and disambiguation have received much attention in recent years. Typical fields addressing these topics are information retrieval, natural language processing, and semantic web. This paper addresses two problems with toponym extraction and disambiguation. First, almost no existing works examine the extraction and disambiguation interdependency. Second, existing disambiguation techniques mostly take as input extracted named entities without considering the uncertainty and imperfection of the extraction process. In this paper we aim to investigate both avenues and to show that explicit handling of the uncertainty of annotation has much potential for making both extraction and disambiguation more robust. We conducted experiments with a set of holiday home descriptions with the aim to extract and disambiguate toponyms. We show that the extraction confidence probabilities are useful in enhancing the effectiveness of disambiguation. Reciprocally, retraining the extraction models with information automatically derived from the disambiguation results, improves the extraction models. This mutual reinforcement is shown to even have an effect after several automatic iterations
Improving named entity disambiguation by iteratively enhancing certainty of extraction
Named entity extraction and disambiguation have received much attention in recent years. Typical fields addressing these topics are information retrieval, natural language processing, and semantic web. This paper addresses two problems with named entity extraction and disambiguation. First, almost no existing works examine the extraction and disambiguation interdependency. Second, existing disambiguation techniques mostly take as input extracted named entities without considering the uncertainty and imperfection of the extraction process. It is the aim of this paper to investigate both avenues and to show that explicit handling of the uncertainty of annotation has much potential for making both extraction and disambiguation more robust. We conducted experiments with a set of holiday home descriptions with the aim to extract and disambiguate toponyms as a representative example of named entities. We show that the effectiveness of extraction influences the effectiveness of disambiguation, and reciprocally, how retraining the extraction models with information automatically derived from the disambiguation results, improves the extraction models. This mutual reinforcement is shown to even have an effect after several iterations
- …