1,170 research outputs found
Direct Speech-to-Text Translation Models as Students of Text-to-Text Models
Direct speech-to-text translation (ST) is an emerging approach that consists in performing the ST task with a single neural model. Although this paradigm comes with the promise to outperform the traditional pipeline systems, its rise is still limited by the paucity of speech-translation paired corpora compared to the large amount of speech-transcript and parallel bilingual corpora available to train previous solutions. As such, the research community focused on techniques to transfer knowledge from automatic speech recognition (ASR) and machine translation (MT) models trained on huge datasets. In this paper, we extend and integrate our recent work (Gaido, Gangi, et al. 2020) analysing the best performing approach to transfer learning from MT, which is represented by knowledge distillation (KD) in sequence-to-sequence models. After the comparison of the different KD methods to understand which one is the most effective, we extend our previous analysis of the effects – both in terms of benefits and drawbacks – to different language pairs in high-resource conditions, ensuring the generalisability of our findings. Altogether, these extensions complement and complete our investigation on KD for speech translation leading to the following overall findings: i) the best training recipe involves a word-level KD training followed by a fine-tuning step on the ST task, ii) word-level KD from MT can be detrimental for gender translation and can lead to output truncation (though these problems are alleviated by the fine-tuning on the ST task), and iii) the quality of the ST student model strongly depends on the quality of the MT teacher model, although the correlation is not linear
Contextualization Distillation from Large Language Model for Knowledge Graph Completion
While textual information significantly enhances the performance of
pre-trained language models (PLMs) in knowledge graph completion (KGC), the
static and noisy nature of existing corpora collected from Wikipedia articles
or synsets definitions often limits the potential of PLM-based KGC models. To
surmount these challenges, we introduce the Contextualization Distillation
strategy, a versatile plug-in-and-play approach compatible with both
discriminative and generative KGC frameworks. Our method begins by instructing
large language models (LLMs) to transform compact, structural triplets into
context-rich segments. Subsequently, we introduce two tailored auxiliary tasks,
reconstruction and contextualization, allowing smaller KGC models to assimilate
insights from these enriched triplets. Comprehensive evaluations across diverse
datasets and KGC techniques highlight the efficacy and adaptability of our
approach, revealing consistent performance enhancements irrespective of
underlying pipelines or architectures. Moreover, our analysis makes our method
more explainable and provides insight into generating path selection, as well
as the choosing of suitable distillation tasks. All the code and data in this
work will be released at
https://github.com/David-Li0406/Contextulization-DistillationComment: Accepted by EACL 2024 findings v3: add missing citation
Knowledge Transfer from Pre-trained Language Models to Cif-based Speech Recognizers via Hierarchical Distillation
Large-scale pre-trained language models (PLMs) have shown great potential in
natural language processing tasks. Leveraging the capabilities of PLMs to
enhance automatic speech recognition (ASR) systems has also emerged as a
promising research direction. However, previous works may be limited by the
inflexible structures of PLMs and the insufficient utilization of PLMs. To
alleviate these problems, we propose the hierarchical knowledge distillation
(HKD) on the continuous integrate-and-fire (CIF) based ASR models. To transfer
knowledge from PLMs to the ASR models, HKD employs cross-modal knowledge
distillation with contrastive loss at the acoustic level and knowledge
distillation with regression loss at the linguistic level. Compared with the
original CIF-based model, our method achieves 15% and 9% relative error rate
reduction on the AISHELL-1 and LibriSpeech datasets, respectively.Comment: Accepted by INTERSPEECH 202
Unsupervised Fact Verification by Language Model Distillation
Unsupervised fact verification aims to verify a claim using evidence from a
trustworthy knowledge base without any kind of data annotation. To address this
challenge, algorithms must produce features for every claim that are both
semantically meaningful, and compact enough to find a semantic alignment with
the source information. In contrast to previous work, which tackled the
alignment problem by learning over annotated corpora of claims and their
corresponding labels, we propose SFAVEL (Self-supervised Fact Verification via
Language Model Distillation), a novel unsupervised framework that leverages
pre-trained language models to distil self-supervised features into
high-quality claim-fact alignments without the need for annotations. This is
enabled by a novel contrastive loss function that encourages features to attain
high-quality claim and evidence alignments whilst preserving the semantic
relationships across the corpora. Notably, we present results that achieve a
new state-of-the-art on the standard FEVER fact verification benchmark (+8%
accuracy) with linear evaluation
Reducing Model Jitter: Stable Re-training of Semantic Parsers in Production Environments
Retraining modern deep learning systems can lead to variations in model
performance even when trained using the same data and hyper-parameters by
simply using different random seeds. We call this phenomenon model jitter. This
issue is often exacerbated in production settings, where models are retrained
on noisy data. In this work we tackle the problem of stable retraining with a
focus on conversational semantic parsers. We first quantify the model jitter
problem by introducing the model agreement metric and showing the variation
with dataset noise and model sizes. We then demonstrate the effectiveness of
various jitter reduction techniques such as ensembling and distillation.
Lastly, we discuss practical trade-offs between such techniques and show that
co-distillation provides a sweet spot in terms of jitter reduction for semantic
parsing systems with only a modest increase in resource usage.Comment: SIGDIAL 2022 Best Pape
GripRank: Bridging the Gap between Retrieval and Generation via the Generative Knowledge Improved Passage Ranking
Retrieval-enhanced text generation, which aims to leverage passages retrieved
from a large passage corpus for delivering a proper answer given the input
query, has shown remarkable progress on knowledge-intensive language tasks such
as open-domain question answering and knowledge-enhanced dialogue generation.
However, the retrieved passages are not ideal for guiding answer generation
because of the discrepancy between retrieval and generation, i.e., the
candidate passages are all treated equally during the retrieval procedure
without considering their potential to generate the proper answers. This
discrepancy makes a passage retriever deliver a sub-optimal collection of
candidate passages to generate answers. In this paper, we propose the
GeneRative Knowledge Improved Passage Ranking (GripRank) approach, addressing
the above challenge by distilling knowledge from a generative passage estimator
(GPE) to a passage ranker, where the GPE is a generative language model used to
measure how likely the candidate passages can generate the proper answer. We
realize the distillation procedure by teaching the passage ranker learning to
rank the passages ordered by the GPE. Furthermore, we improve the distillation
quality by devising a curriculum knowledge distillation mechanism, which allows
the knowledge provided by the GPE can be progressively distilled to the ranker
through an easy-to-hard curriculum, enabling the passage ranker to correctly
recognize the provenance of the answer from many plausible candidates. We
conduct extensive experiments on four datasets across three knowledge-intensive
language tasks. Experimental results show advantages over the state-of-the-art
methods for both passage ranking and answer generation on the KILT benchmark.Comment: 11 pages, 4 figure
- …