22 research outputs found
Structured Pruning Learns Compact and Accurate Models
The growing size of neural language models has led to increased attention in
model compression. The two predominant approaches are pruning, which gradually
removes weights from a pre-trained model, and distillation, which trains a
smaller compact model to match a larger one. Pruning methods can significantly
reduce the model size but hardly achieve large speedups as distillation.
However, distillation methods require large amounts of unlabeled data and are
expensive to train. In this work, we propose a task-specific structured pruning
method CoFi (Coarse- and Fine-grained Pruning), which delivers highly
parallelizable subnetworks and matches the distillation methods in both
accuracy and latency, without resorting to any unlabeled data. Our key insight
is to jointly prune coarse-grained (e.g., layers) and fine-grained (e.g., heads
and hidden units) modules, which controls the pruning decision of each
parameter with masks of different granularity. We also devise a layerwise
distillation strategy to transfer knowledge from unpruned to pruned models
during optimization. Our experiments on GLUE and SQuAD datasets show that CoFi
yields models with over 10x speedups with a small accuracy drop, showing its
effectiveness and efficiency compared to previous pruning and distillation
approaches.Comment: Accepted to ACL 2022; The code and models are available at
https://github.com/princeton-nlp/CoFiPrunin
Trainable Transformer in Transformer
Recent works attribute the capability of in-context learning (ICL) in large
pre-trained language models to implicitly simulating and fine-tuning an
internal model (e.g., linear or 2-layer MLP) during inference. However, such
constructions require large memory overhead, which makes simulation of more
sophisticated internal models intractable. In this work, we propose an
efficient construction, Transformer in Transformer (in short, TinT), that
allows a transformer to simulate and fine-tune complex models internally during
inference (e.g., pre-trained language models). In particular, we introduce
innovative approximation techniques that allow a TinT model with less than 2
billion parameters to simulate and fine-tune a 125 million parameter
transformer model within a single forward pass. TinT accommodates many common
transformer variants and its design ideas also improve the efficiency of past
instantiations of simple models inside transformers. We conduct end-to-end
experiments to validate the internal fine-tuning procedure of TinT on various
language modeling and downstream tasks. For example, even with a limited
one-step budget, we observe TinT for a OPT-125M model improves performance by
4-16% absolute on average compared to OPT-125M. These findings suggest that
large pre-trained language models are capable of performing intricate
subroutines. To facilitate further work, a modular and extensible codebase for
TinT is included.Comment: Code base:
https://github.com/abhishekpanigrahi1996/transformer_in_transforme
MABEL: Attenuating Gender Bias using Textual Entailment Data
Pre-trained language models encode undesirable social biases, which are
further exacerbated in downstream use. To this end, we propose MABEL (a Method
for Attenuating Gender Bias using Entailment Labels), an intermediate
pre-training approach for mitigating gender bias in contextualized
representations. Key to our approach is the use of a contrastive learning
objective on counterfactually augmented, gender-balanced entailment pairs from
natural language inference (NLI) datasets. We also introduce an alignment
regularizer that pulls identical entailment pairs along opposite gender
directions closer. We extensively evaluate our approach on intrinsic and
extrinsic metrics, and show that MABEL outperforms previous task-agnostic
debiasing approaches in terms of fairness. It also preserves task performance
after fine-tuning on downstream tasks. Together, these findings demonstrate the
suitability of NLI data as an effective means of bias mitigation, as opposed to
only using unlabeled sentences in the literature. Finally, we identify that
existing approaches often use evaluation settings that are insufficient or
inconsistent. We make an effort to reproduce and compare previous methods, and
call for unifying the evaluation settings across gender debiasing methods for
better future comparison.Comment: Accepted to EMNLP 2022. Code and models are publicly available at
https://github.com/princeton-nlp/mabe
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
The popularity of LLaMA (Touvron et al., 2023a;b) and other recently emerged
moderate-sized large language models (LLMs) highlights the potential of
building smaller yet powerful LLMs. Regardless, the cost of training such
models from scratch on trillions of tokens remains high. In this work, we study
structured pruning as an effective means to develop smaller LLMs from
pre-trained, larger models. Our approach employs two key techniques: (1)
targeted structured pruning, which prunes a larger model to a specified target
shape by removing layers, heads, and intermediate and hidden dimensions in an
end-to-end manner, and (2) dynamic batch loading, which dynamically updates the
composition of sampled data in each training batch based on varying losses
across different domains. We demonstrate the efficacy of our approach by
presenting the Sheared-LLaMA series, pruning the LLaMA2-7B model down to 1.3B
and 2.7B parameters. Sheared-LLaMA models outperform state-of-the-art
open-source models of equivalent sizes, such as Pythia, INCITE, and OpenLLaMA
models, on a wide range of downstream and instruction tuning evaluations, while
requiring only 3% of compute compared to training such models from scratch.
This work provides compelling evidence that leveraging existing LLMs with
structured pruning is a far more cost-effective approach for building smaller
LLMs.Comment: The code and models are available at
https://github.com/princeton-nlp/LLM-Shearin
Prompting ELECTRA: Few-Shot Learning with Discriminative Pre-Trained Models
Pre-trained masked language models successfully perform few-shot learning by
formulating downstream tasks as text infilling. However, as a strong
alternative in full-shot settings, discriminative pre-trained models like
ELECTRA do not fit into the paradigm. In this work, we adapt prompt-based
few-shot learning to ELECTRA and show that it outperforms masked language
models in a wide range of tasks. ELECTRA is pre-trained to distinguish if a
token is generated or original. We naturally extend that to prompt-based
few-shot learning by training to score the originality of the target options
without introducing new parameters. Our method can be easily adapted to tasks
involving multi-token predictions without extra computation overhead. Analysis
shows that ELECTRA learns distributions that align better with downstream
tasks.Comment: Accepted to EMNLP 2022; The code is available at
https://github.com/facebookresearch/ELECTRA-Fewshot-Learnin
Detecting Pretraining Data from Large Language Models
Although large language models (LLMs) are widely deployed, the data used to
train them is rarely disclosed. Given the incredible scale of this data, up to
trillions of tokens, it is all but certain that it includes potentially
problematic text such as copyrighted materials, personally identifiable
information, and test data for widely reported reference benchmarks. However,
we currently have no way to know which data of these types is included or in
what proportions. In this paper, we study the pretraining data detection
problem: given a piece of text and black-box access to an LLM without knowing
the pretraining data, can we determine if the model was trained on the provided
text? To facilitate this study, we introduce a dynamic benchmark WIKIMIA that
uses data created before and after model training to support gold truth
detection. We also introduce a new detection method Min-K% Prob based on a
simple hypothesis: an unseen example is likely to contain a few outlier words
with low probabilities under the LLM, while a seen example is less likely to
have words with such low probabilities. Min-K% Prob can be applied without any
knowledge about the pretraining corpus or any additional training, departing
from previous detection methods that require training a reference model on data
that is similar to the pretraining data. Moreover, our experiments demonstrate
that Min-K% Prob achieves a 7.4% improvement on WIKIMIA over these previous
methods. We apply Min-K% Prob to three real-world scenarios, copyrighted book
detection, contaminated downstream example detection and privacy auditing of
machine unlearning, and find it a consistently effective solution
Training Trajectories of Language Models Across Scales
Scaling up language models has led to unprecedented performance gains, but
little is understood about how the training dynamics change as models get
larger. How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors? In this
paper, we analyze the intermediate training checkpoints of differently sized
OPT models (Zhang et al.,2022)--from 125M to 175B parameters--on next-token
prediction, sequence-level generation, and downstream tasks. We find that 1) at
a given perplexity and independent of model sizes, a similar subset of training
tokens see the most significant reduction in loss, with the rest stagnating or
showing double-descent behavior; 2) early in training, all models learn to
reduce the perplexity of grammatical sequences that contain hallucinations,
with small models halting at this suboptimal distribution and larger ones
eventually learning to assign these sequences lower probabilities; 3)
perplexity is a strong predictor of in-context learning performance on 74
multiple-choice tasks from BIG-Bench, and this holds independent of the model
size. Together, these results show that perplexity is more predictive of model
behaviors than model size or training computation.Comment: Accepted to ACL 2023; The code and analysis results are available at
https://github.com/xiamengzhou/training_trajectory_analysi