74 research outputs found
Frustratingly Simple Memory Efficiency for Pre-trained Language Models via Dynamic Embedding Pruning
The extensive memory footprint of pre-trained language models (PLMs) can
hinder deployment in memory-constrained settings, such as cloud environments or
on-device. PLMs use embedding matrices to represent extensive vocabularies,
forming a large proportion of the model parameters. While previous work towards
parameter-efficient PLM development has considered pruning parameters within
the transformer layers, pruning the embedding matrix as part of fine-tuning or
inference has yet to be explored. We first demonstrate that a significant
proportion of the vocabulary remains unused in these scenarios. We then propose
a simple yet effective approach that leverages this finding to minimize the
memory footprint of the embedding matrix. We show that this approach provides
substantial reductions in memory usage across a wide range of models and tasks.
Notably, our approach maintains equivalent downstream task performance while
allowing a more efficient use of compute resources
On the Limitations of Simulating Active Learning
Active learning (AL) is a human-and-model-in-the-loop paradigm that
iteratively selects informative unlabeled data for human annotation, aiming to
improve over random sampling. However, performing AL experiments with human
annotations on-the-fly is a laborious and expensive process, thus unrealistic
for academic research. An easy fix to this impediment is to simulate AL, by
treating an already labeled and publicly available dataset as the pool of
unlabeled data. In this position paper, we first survey recent literature and
highlight the challenges across all different steps within the AL loop. We
further unveil neglected caveats in the experimental setup that can
significantly affect the quality of AL research. We continue with an
exploration of how the simulation setting can govern empirical findings,
arguing that it might be one of the answers behind the ever posed question
``why do active learning algorithms sometimes fail to outperform random
sampling?''. We argue that evaluating AL algorithms on available labeled
datasets might provide a lower bound as to their effectiveness in real data. We
believe it is essential to collectively shape the best practices for AL
research, particularly as engineering advancements in LLMs push the research
focus towards data-driven approaches (e.g., data efficiency, alignment,
fairness). In light of this, we have developed guidelines for future work. Our
aim is to draw attention to these limitations within the community, in the hope
of finding ways to address them.Comment: To appear at Findings of ACL 202
Interpreting Document Collections with Topic Models
This thesis concerns topic models, a set of statistical methods for interpreting the contents of document collections. These models automatically learn sets of topics from words frequently co-occurring in documents. Topics learned often represent abstract thematic subjects, i.e Sports or Politics. Topics are also associated with relevant documents.
These characteristics make topic models a useful tool for organising large digital libraries. Hence, these methods have been used to develop browsing systems allowing users to navigate through and identify relevant information in document collections by providing users with sets of topics that contain relevant documents.
First, we look at the problem of identifying incoherent topics. We show that our methods work better than previously proposed approaches. Next, we propose novel methods for efficiently identifying semantically related topics which can be used for topic recommendation. Finally, we look at the problem of alternative topic representations to topic keywords. We propose approaches that provide textual or image labels which assist in topic interpretability. We also compare different topic representations within a document browsing system
Understanding the Role of Input Token Characters in Language Models: How Does Information Loss Affect Performance?
Understanding how and what pre-trained language models (PLMs) learn about
language is an open challenge in natural language processing. Previous work has
focused on identifying whether they capture semantic and syntactic information,
and how the data or the pre-training objective affects their performance.
However, to the best of our knowledge, no previous work has specifically
examined how information loss in input token characters affects the performance
of PLMs. In this study, we address this gap by pre-training language models
using small subsets of characters from individual tokens. Surprisingly, we find
that pre-training even under extreme settings, i.e. using only one character of
each token, the performance retention in standard NLU benchmarks and probing
tasks compared to full-token models is high. For instance, a model pre-trained
only on single first characters from tokens achieves performance retention of
approximately \% and \% of the full-token model in SuperGLUE and GLUE
tasks, respectively.Comment: To appear at EMNLP 202
A Multimodal Analysis of Influencer Content on Twitter
Influencer marketing involves a wide range of strategies in which brands
collaborate with popular content creators (i.e., influencers) to leverage their
reach, trust, and impact on their audience to promote and endorse products or
services. Because followers of influencers are more likely to buy a product
after receiving an authentic product endorsement rather than an explicit direct
product promotion, the line between personal opinions and commercial content
promotion is frequently blurred. This makes automatic detection of regulatory
compliance breaches related to influencer advertising (e.g., misleading
advertising or hidden sponsorships) particularly difficult. In this work, we
(1) introduce a new Twitter (now X) dataset consisting of 15,998 influencer
posts mapped into commercial and non-commercial categories for assisting in the
automatic detection of commercial influencer content; (2) experiment with an
extensive set of predictive models that combine text and visual information
showing that our proposed cross-attention approach outperforms state-of-the-art
multimodal models; and (3) conduct a thorough analysis of strengths and
limitations of our models. We show that multimodal modeling is useful for
identifying commercial posts, reducing the amount of false positives, and
capturing relevant context that aids in the discovery of undisclosed commercial
posts.Comment: Accepted at AACL 202
Investigating Hallucinations in Pruned Large Language Models for Abstractive Summarization
Despite the remarkable performance of generative large language models (LLMs)
on abstractive summarization, they face two significant challenges: their
considerable size and tendency to hallucinate. Hallucinations are concerning
because they erode reliability and raise safety issues. Pruning is a technique
that reduces model size by removing redundant weights, enabling more efficient
sparse inference. Pruned models yield downstream task performance comparable to
the original, making them ideal alternatives when operating on a limited
budget. However, the effect that pruning has upon hallucinations in abstractive
summarization with LLMs has yet to be explored. In this paper, we provide an
extensive empirical study across five summarization datasets, two
state-of-the-art pruning methods, and five instruction-tuned LLMs.
Surprisingly, we find that hallucinations from pruned LLMs are less prevalent
than the original models. Our analysis suggests that pruned models tend to
depend more on the source document for summary generation. This leads to a
higher lexical overlap between the generated summary and the source document,
which could be a reason for the reduction in hallucination risk
- …