18 research outputs found
Structured Pruning Learns Compact and Accurate Models
The growing size of neural language models has led to increased attention in
model compression. The two predominant approaches are pruning, which gradually
removes weights from a pre-trained model, and distillation, which trains a
smaller compact model to match a larger one. Pruning methods can significantly
reduce the model size but hardly achieve large speedups as distillation.
However, distillation methods require large amounts of unlabeled data and are
expensive to train. In this work, we propose a task-specific structured pruning
method CoFi (Coarse- and Fine-grained Pruning), which delivers highly
parallelizable subnetworks and matches the distillation methods in both
accuracy and latency, without resorting to any unlabeled data. Our key insight
is to jointly prune coarse-grained (e.g., layers) and fine-grained (e.g., heads
and hidden units) modules, which controls the pruning decision of each
parameter with masks of different granularity. We also devise a layerwise
distillation strategy to transfer knowledge from unpruned to pruned models
during optimization. Our experiments on GLUE and SQuAD datasets show that CoFi
yields models with over 10x speedups with a small accuracy drop, showing its
effectiveness and efficiency compared to previous pruning and distillation
approaches.Comment: Accepted to ACL 2022; The code and models are available at
https://github.com/princeton-nlp/CoFiPrunin
MABEL: Attenuating Gender Bias using Textual Entailment Data
Pre-trained language models encode undesirable social biases, which are
further exacerbated in downstream use. To this end, we propose MABEL (a Method
for Attenuating Gender Bias using Entailment Labels), an intermediate
pre-training approach for mitigating gender bias in contextualized
representations. Key to our approach is the use of a contrastive learning
objective on counterfactually augmented, gender-balanced entailment pairs from
natural language inference (NLI) datasets. We also introduce an alignment
regularizer that pulls identical entailment pairs along opposite gender
directions closer. We extensively evaluate our approach on intrinsic and
extrinsic metrics, and show that MABEL outperforms previous task-agnostic
debiasing approaches in terms of fairness. It also preserves task performance
after fine-tuning on downstream tasks. Together, these findings demonstrate the
suitability of NLI data as an effective means of bias mitigation, as opposed to
only using unlabeled sentences in the literature. Finally, we identify that
existing approaches often use evaluation settings that are insufficient or
inconsistent. We make an effort to reproduce and compare previous methods, and
call for unifying the evaluation settings across gender debiasing methods for
better future comparison.Comment: Accepted to EMNLP 2022. Code and models are publicly available at
https://github.com/princeton-nlp/mabe
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
The popularity of LLaMA (Touvron et al., 2023a;b) and other recently emerged
moderate-sized large language models (LLMs) highlights the potential of
building smaller yet powerful LLMs. Regardless, the cost of training such
models from scratch on trillions of tokens remains high. In this work, we study
structured pruning as an effective means to develop smaller LLMs from
pre-trained, larger models. Our approach employs two key techniques: (1)
targeted structured pruning, which prunes a larger model to a specified target
shape by removing layers, heads, and intermediate and hidden dimensions in an
end-to-end manner, and (2) dynamic batch loading, which dynamically updates the
composition of sampled data in each training batch based on varying losses
across different domains. We demonstrate the efficacy of our approach by
presenting the Sheared-LLaMA series, pruning the LLaMA2-7B model down to 1.3B
and 2.7B parameters. Sheared-LLaMA models outperform state-of-the-art
open-source models of equivalent sizes, such as Pythia, INCITE, and OpenLLaMA
models, on a wide range of downstream and instruction tuning evaluations, while
requiring only 3% of compute compared to training such models from scratch.
This work provides compelling evidence that leveraging existing LLMs with
structured pruning is a far more cost-effective approach for building smaller
LLMs.Comment: The code and models are available at
https://github.com/princeton-nlp/LLM-Shearin
Prompting ELECTRA: Few-Shot Learning with Discriminative Pre-Trained Models
Pre-trained masked language models successfully perform few-shot learning by
formulating downstream tasks as text infilling. However, as a strong
alternative in full-shot settings, discriminative pre-trained models like
ELECTRA do not fit into the paradigm. In this work, we adapt prompt-based
few-shot learning to ELECTRA and show that it outperforms masked language
models in a wide range of tasks. ELECTRA is pre-trained to distinguish if a
token is generated or original. We naturally extend that to prompt-based
few-shot learning by training to score the originality of the target options
without introducing new parameters. Our method can be easily adapted to tasks
involving multi-token predictions without extra computation overhead. Analysis
shows that ELECTRA learns distributions that align better with downstream
tasks.Comment: Accepted to EMNLP 2022; The code is available at
https://github.com/facebookresearch/ELECTRA-Fewshot-Learnin
Training Trajectories of Language Models Across Scales
Scaling up language models has led to unprecedented performance gains, but
little is understood about how the training dynamics change as models get
larger. How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors? In this
paper, we analyze the intermediate training checkpoints of differently sized
OPT models (Zhang et al.,2022)--from 125M to 175B parameters--on next-token
prediction, sequence-level generation, and downstream tasks. We find that 1) at
a given perplexity and independent of model sizes, a similar subset of training
tokens see the most significant reduction in loss, with the rest stagnating or
showing double-descent behavior; 2) early in training, all models learn to
reduce the perplexity of grammatical sequences that contain hallucinations,
with small models halting at this suboptimal distribution and larger ones
eventually learning to assign these sequences lower probabilities; 3)
perplexity is a strong predictor of in-context learning performance on 74
multiple-choice tasks from BIG-Bench, and this holds independent of the model
size. Together, these results show that perplexity is more predictive of model
behaviors than model size or training computation.Comment: Accepted to ACL 2023; The code and analysis results are available at
https://github.com/xiamengzhou/training_trajectory_analysi
Efficacy of Minocycline in Acute Ischemic Stroke: A Systematic Review and Meta-Analysis of Rodent and Clinical Studies
Objectives: This study aimed to assess the efficacy of minocycline for the treatment of acute ischemic stroke.Background: While there have been meta-analysis that surveyed the efficacy of minocycline in the treatment of acute stroke, they have some methodological limitations. We performed a new systematic review which was distinct from previous one by adding new outcomes and including new studies.Methods: Document retrieval was executed through PubMed, Cochrane Central Register of Controlled Trials, the Stroke Center, NIH's Clinical Trials, Current Controlled Trials, and the WHO International Clinical Trials Registry Platform Search Portal before Jan 2018. The data meeting the inclusion criteria were extracted. Before meta-analysis, publication bias and heterogeneity of included studies were surveyed. Random and fixed-effects models were employed to calculate pooled estimates and 95% confidence intervals (CIs). Additionally, sensitivity and subgroup analyses were implemented.Result: For clinical studies, 4 trials with 201 patients in the minocycline group, and 195 patients in the control group met the inclusion criteria; 3 were randomized trials. At the end of 90-day follow up or discharge day, results showed that the groups receiving minocycline were superior to the control group, with significant differences in the NIHSS scores (mean difference [MD], β2.75; 95% CI, β4.78, 0.27; p = 0.03) and mRS scores (MD, β0.98; 95% CI, β1.27, β0.69; p < 0.01), but not Barthel Index Score (MD, 9.04; 95% CI, β0.78, 18.07; p = 0.07). For rodent experiments, 14 studies were included. Neurological severity scores (NSS) was significantly improved (MD, β1.38; 95% CI, β1.64, β1.31; p < 0.01) and infarct volume was obviously reduced (Std mean difference [SMD], β2.38; 95% CI, β3.40, β1.36; p < 0.01) in the minocycline group. Heterogeneity among the studies was proved to exist for infarct volume (Chi2 = 116.12, p < 0.01; I2 = 0.89) but not for other variables.Conclusions: Based on the results in our study, minocycline appears as an effective therapeutic option for acute ischemic stroke