18 research outputs found

    Structured Pruning Learns Compact and Accurate Models

    Full text link
    The growing size of neural language models has led to increased attention in model compression. The two predominant approaches are pruning, which gradually removes weights from a pre-trained model, and distillation, which trains a smaller compact model to match a larger one. Pruning methods can significantly reduce the model size but hardly achieve large speedups as distillation. However, distillation methods require large amounts of unlabeled data and are expensive to train. In this work, we propose a task-specific structured pruning method CoFi (Coarse- and Fine-grained Pruning), which delivers highly parallelizable subnetworks and matches the distillation methods in both accuracy and latency, without resorting to any unlabeled data. Our key insight is to jointly prune coarse-grained (e.g., layers) and fine-grained (e.g., heads and hidden units) modules, which controls the pruning decision of each parameter with masks of different granularity. We also devise a layerwise distillation strategy to transfer knowledge from unpruned to pruned models during optimization. Our experiments on GLUE and SQuAD datasets show that CoFi yields models with over 10x speedups with a small accuracy drop, showing its effectiveness and efficiency compared to previous pruning and distillation approaches.Comment: Accepted to ACL 2022; The code and models are available at https://github.com/princeton-nlp/CoFiPrunin

    MABEL: Attenuating Gender Bias using Textual Entailment Data

    Full text link
    Pre-trained language models encode undesirable social biases, which are further exacerbated in downstream use. To this end, we propose MABEL (a Method for Attenuating Gender Bias using Entailment Labels), an intermediate pre-training approach for mitigating gender bias in contextualized representations. Key to our approach is the use of a contrastive learning objective on counterfactually augmented, gender-balanced entailment pairs from natural language inference (NLI) datasets. We also introduce an alignment regularizer that pulls identical entailment pairs along opposite gender directions closer. We extensively evaluate our approach on intrinsic and extrinsic metrics, and show that MABEL outperforms previous task-agnostic debiasing approaches in terms of fairness. It also preserves task performance after fine-tuning on downstream tasks. Together, these findings demonstrate the suitability of NLI data as an effective means of bias mitigation, as opposed to only using unlabeled sentences in the literature. Finally, we identify that existing approaches often use evaluation settings that are insufficient or inconsistent. We make an effort to reproduce and compare previous methods, and call for unifying the evaluation settings across gender debiasing methods for better future comparison.Comment: Accepted to EMNLP 2022. Code and models are publicly available at https://github.com/princeton-nlp/mabe

    Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

    Full text link
    The popularity of LLaMA (Touvron et al., 2023a;b) and other recently emerged moderate-sized large language models (LLMs) highlights the potential of building smaller yet powerful LLMs. Regardless, the cost of training such models from scratch on trillions of tokens remains high. In this work, we study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models. Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains. We demonstrate the efficacy of our approach by presenting the Sheared-LLaMA series, pruning the LLaMA2-7B model down to 1.3B and 2.7B parameters. Sheared-LLaMA models outperform state-of-the-art open-source models of equivalent sizes, such as Pythia, INCITE, and OpenLLaMA models, on a wide range of downstream and instruction tuning evaluations, while requiring only 3% of compute compared to training such models from scratch. This work provides compelling evidence that leveraging existing LLMs with structured pruning is a far more cost-effective approach for building smaller LLMs.Comment: The code and models are available at https://github.com/princeton-nlp/LLM-Shearin

    Prompting ELECTRA: Few-Shot Learning with Discriminative Pre-Trained Models

    Full text link
    Pre-trained masked language models successfully perform few-shot learning by formulating downstream tasks as text infilling. However, as a strong alternative in full-shot settings, discriminative pre-trained models like ELECTRA do not fit into the paradigm. In this work, we adapt prompt-based few-shot learning to ELECTRA and show that it outperforms masked language models in a wide range of tasks. ELECTRA is pre-trained to distinguish if a token is generated or original. We naturally extend that to prompt-based few-shot learning by training to score the originality of the target options without introducing new parameters. Our method can be easily adapted to tasks involving multi-token predictions without extra computation overhead. Analysis shows that ELECTRA learns distributions that align better with downstream tasks.Comment: Accepted to EMNLP 2022; The code is available at https://github.com/facebookresearch/ELECTRA-Fewshot-Learnin

    Training Trajectories of Language Models Across Scales

    Full text link
    Scaling up language models has led to unprecedented performance gains, but little is understood about how the training dynamics change as models get larger. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors? In this paper, we analyze the intermediate training checkpoints of differently sized OPT models (Zhang et al.,2022)--from 125M to 175B parameters--on next-token prediction, sequence-level generation, and downstream tasks. We find that 1) at a given perplexity and independent of model sizes, a similar subset of training tokens see the most significant reduction in loss, with the rest stagnating or showing double-descent behavior; 2) early in training, all models learn to reduce the perplexity of grammatical sequences that contain hallucinations, with small models halting at this suboptimal distribution and larger ones eventually learning to assign these sequences lower probabilities; 3) perplexity is a strong predictor of in-context learning performance on 74 multiple-choice tasks from BIG-Bench, and this holds independent of the model size. Together, these results show that perplexity is more predictive of model behaviors than model size or training computation.Comment: Accepted to ACL 2023; The code and analysis results are available at https://github.com/xiamengzhou/training_trajectory_analysi

    Efficacy of Minocycline in Acute Ischemic Stroke: A Systematic Review and Meta-Analysis of Rodent and Clinical Studies

    Get PDF
    Objectives: This study aimed to assess the efficacy of minocycline for the treatment of acute ischemic stroke.Background: While there have been meta-analysis that surveyed the efficacy of minocycline in the treatment of acute stroke, they have some methodological limitations. We performed a new systematic review which was distinct from previous one by adding new outcomes and including new studies.Methods: Document retrieval was executed through PubMed, Cochrane Central Register of Controlled Trials, the Stroke Center, NIH's Clinical Trials, Current Controlled Trials, and the WHO International Clinical Trials Registry Platform Search Portal before Jan 2018. The data meeting the inclusion criteria were extracted. Before meta-analysis, publication bias and heterogeneity of included studies were surveyed. Random and fixed-effects models were employed to calculate pooled estimates and 95% confidence intervals (CIs). Additionally, sensitivity and subgroup analyses were implemented.Result: For clinical studies, 4 trials with 201 patients in the minocycline group, and 195 patients in the control group met the inclusion criteria; 3 were randomized trials. At the end of 90-day follow up or discharge day, results showed that the groups receiving minocycline were superior to the control group, with significant differences in the NIHSS scores (mean difference [MD], βˆ’2.75; 95% CI, βˆ’4.78, 0.27; p = 0.03) and mRS scores (MD, βˆ’0.98; 95% CI, βˆ’1.27, βˆ’0.69; p < 0.01), but not Barthel Index Score (MD, 9.04; 95% CI, βˆ’0.78, 18.07; p = 0.07). For rodent experiments, 14 studies were included. Neurological severity scores (NSS) was significantly improved (MD, βˆ’1.38; 95% CI, βˆ’1.64, βˆ’1.31; p < 0.01) and infarct volume was obviously reduced (Std mean difference [SMD], βˆ’2.38; 95% CI, βˆ’3.40, βˆ’1.36; p < 0.01) in the minocycline group. Heterogeneity among the studies was proved to exist for infarct volume (Chi2 = 116.12, p < 0.01; I2 = 0.89) but not for other variables.Conclusions: Based on the results in our study, minocycline appears as an effective therapeutic option for acute ischemic stroke
    corecore