13 research outputs found
Investigating Training Strategies and Model Robustness of Low-Rank Adaptation for Language Modeling in Speech Recognition
The use of low-rank adaptation (LoRA) with frozen pretrained language models
(PLMs) has become increasing popular as a mainstream, resource-efficient
modeling approach for memory-constrained hardware. In this study, we first
explore how to enhance model performance by introducing various LoRA training
strategies, achieving relative word error rate reductions of 3.50\% on the
public Librispeech dataset and of 3.67\% on an internal dataset in the
messaging domain. To further characterize the stability of LoRA-based
second-pass speech recognition models, we examine robustness against input
perturbations. These perturbations are rooted in homophone replacements and a
novel metric called N-best Perturbation-based Rescoring Robustness (NPRR), both
designed to measure the relative degradation in the performance of rescoring
models. Our experimental results indicate that while advanced variants of LoRA,
such as dynamic rank-allocated LoRA, lead to performance degradation in
-best perturbation, they alleviate the degradation in -best perturbation.
This finding is in comparison to fully-tuned models and vanilla LoRA tuning
baselines, suggesting that a comprehensive selection is needed when using
LoRA-based adaptation for compute-cost savings and robust language modeling
VAuLT: Augmenting the Vision-and-Language Transformer for Sentiment Classification on Social Media
We propose the Vision-and-Augmented-Language Transformer (VAuLT). VAuLT is an
extension of the popular Vision-and-Language Transformer (ViLT), and improves
performance on vision-and-language (VL) tasks that involve more complex text
inputs than image captions while having minimal impact on training and
inference efficiency. ViLT, importantly, enables efficient training and
inference in VL tasks, achieved by encoding images using a linear projection of
patches instead of an object detector. However, it is pretrained on captioning
datasets, where the language input is simple, literal, and descriptive,
therefore lacking linguistic diversity. So, when working with multimedia data
in the wild, such as multimodal social media data, there is a notable shift
from captioning language data, as well as diversity of tasks. We indeed find
evidence that the language capacity of ViLT is lacking. The key insight of
VAuLT is to propagate the output representations of a large language model (LM)
like BERT to the language input of ViLT. We show that joint training of the LM
and ViLT in VAuLT can yield relative improvements up to 20% over ViLT on VL
tasks involving richer language inputs and affective constructs, such as for
Target-Oriented Sentiment Classification in TWITTER-2015 and TWITTER-2017, and
Sentiment Classification in MVSA-Single and MVSA-Multiple. Our code is
available at https://github.com/gchochla/VAuLT.Comment: 5 pages, 1 figur
Scaled-up Discovery of Latent Concepts in Deep NLP Models
Pre-trained language models (pLMs) learn intricate patterns and contextual
dependencies via unsupervised learning on vast text data, driving breakthroughs
across NLP tasks. Despite these achievements, these models remain black boxes,
necessitating research into understanding their decision-making processes.
Recent studies explore representation analysis by clustering latent spaces
within pre-trained models. However, these approaches are limited in terms of
scalability and the scope of interpretation because of high computation costs
of clustering algorithms. This study focuses on comparing clustering algorithms
for the purpose of scaling encoded concept discovery of representations from
pLMs. Specifically, we compare three algorithms in their capacity to unveil the
encoded concepts through their alignment to human-defined ontologies:
Agglomerative Hierarchical Clustering, Leaders Algorithm, and K-Means
Clustering. Our results show that K-Means has the potential to scale to very
large datasets, allowing rich latent concept discovery, both on the word and
phrase level
Few-Shot Spoken Language Understanding via Joint Speech-Text Models
Recent work on speech representation models jointly pre-trained with text has
demonstrated the potential of improving speech representations by encoding
speech and text in a shared space. In this paper, we leverage such shared
representations to address the persistent challenge of limited data
availability in spoken language understanding tasks. By employing a pre-trained
speech-text model, we find that models fine-tuned on text can be effectively
transferred to speech testing data. With as little as 1 hour of labeled speech
data, our proposed approach achieves comparable performance on spoken language
understanding tasks (specifically, sentiment analysis and named entity
recognition) when compared to previous methods using speech-only pre-trained
models fine-tuned on 10 times more data. Beyond the proof-of-concept study, we
also analyze the latent representations. We find that the bottom layers of
speech-text models are largely task-agnostic and align speech and text
representations into a shared space, while the top layers are more
task-specific
On the Effectiveness of Compact Biomedical Transformers
Language models pre-trained on biomedical corpora, such as BioBERT, have
recently shown promising results on downstream biomedical tasks. Many existing
pre-trained models, on the other hand, are resource-intensive and
computationally heavy owing to factors such as embedding size, hidden
dimension, and number of layers. The natural language processing (NLP)
community has developed numerous strategies to compress these models utilising
techniques such as pruning, quantisation, and knowledge distillation, resulting
in models that are considerably faster, smaller, and subsequently easier to use
in practice. By the same token, in this paper we introduce six lightweight
models, namely, BioDistilBERT, BioTinyBERT, BioMobileBERT, DistilBioBERT,
TinyBioBERT, and CompactBioBERT which are obtained either by knowledge
distillation from a biomedical teacher or continual learning on the Pubmed
dataset via the Masked Language Modelling (MLM) objective. We evaluate all of
our models on three biomedical tasks and compare them with BioBERT-v1.1 to
create efficient lightweight models that perform on par with their larger
counterparts. All the models will be publicly available on our Huggingface
profile at https://huggingface.co/nlpie and the codes used to run the
experiments will be available at
https://github.com/nlpie-research/Compact-Biomedical-Transformers
Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters & Less Data
Multi-Task Learning (MTL) networks have emerged as a promising method for
transferring learned knowledge across different tasks. However, MTL must deal
with challenges such as: overfitting to low resource tasks, catastrophic
forgetting, and negative task transfer, or learning interference. Often, in
Natural Language Processing (NLP), a separate model per task is needed to
obtain the best performance. However, many fine-tuning approaches are both
parameter inefficient, i.e., potentially involving one new model per task, and
highly susceptible to losing knowledge acquired during pretraining. We propose
a novel Transformer architecture consisting of a new conditional attention
mechanism as well as a set of task-conditioned modules that facilitate weight
sharing. Through this construction, we achieve more efficient parameter sharing
and mitigate forgetting by keeping half of the weights of a pretrained model
fixed. We also use a new multi-task data sampling strategy to mitigate the
negative effects of data imbalance across tasks. Using this approach, we are
able to surpass single task fine-tuning methods while being parameter and data
efficient (using around 66% of the data for weight updates). Compared to other
BERT Large methods on GLUE, our 8-task model surpasses other Adapter methods by
2.8% and our 24-task model outperforms by 0.7-1.0% models that use MTL and
single task fine-tuning. We show that a larger variant of our single multi-task
model approach performs competitively across 26 NLP tasks and yields
state-of-the-art results on a number of test and development sets. Our code is
publicly available at https://github.com/CAMTL/CA-MTL.Comment: ICLR 2021 (Reprint