13 research outputs found
Prompting ELECTRA: Few-Shot Learning with Discriminative Pre-Trained Models
Pre-trained masked language models successfully perform few-shot learning by
formulating downstream tasks as text infilling. However, as a strong
alternative in full-shot settings, discriminative pre-trained models like
ELECTRA do not fit into the paradigm. In this work, we adapt prompt-based
few-shot learning to ELECTRA and show that it outperforms masked language
models in a wide range of tasks. ELECTRA is pre-trained to distinguish if a
token is generated or original. We naturally extend that to prompt-based
few-shot learning by training to score the originality of the target options
without introducing new parameters. Our method can be easily adapted to tasks
involving multi-token predictions without extra computation overhead. Analysis
shows that ELECTRA learns distributions that align better with downstream
tasks.Comment: Accepted to EMNLP 2022; The code is available at
https://github.com/facebookresearch/ELECTRA-Fewshot-Learnin
Complementary Explanations for Effective In-Context Learning
Large language models (LLMs) have exhibited remarkable capabilities in
learning from explanations in prompts. Yet, there has been limited
understanding of what makes explanations effective for in-context learning.
This work aims to better understand the mechanisms by which explanations are
used for in-context learning. We first study the impact of two different
factors on prompting performance when using explanations: the computation trace
(the way the solution is decomposed) and the natural language of the prompt. By
perturbing explanations on three controlled tasks, we show that both factors
contribute to the effectiveness of explanations, indicating that LLMs do
faithfully follow the explanations to some extent. We further study how to form
maximally effective sets of explanations for solving a given test query. We
find that LLMs can benefit from the complementarity of the explanation set as
they are able to fuse different reasoning specified by individual exemplars in
prompts. Additionally, having relevant exemplars also contributes to more
effective prompts. Therefore, we propose a maximal-marginal-relevance-based
exemplar selection approach for constructing exemplar sets that are both
relevant as well as complementary, which successfully improves the in-context
learning performance across three real-world tasks on multiple LLMs
Training Trajectories of Language Models Across Scales
Scaling up language models has led to unprecedented performance gains, but
little is understood about how the training dynamics change as models get
larger. How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors? In this
paper, we analyze the intermediate training checkpoints of differently sized
OPT models (Zhang et al.,2022)--from 125M to 175B parameters--on next-token
prediction, sequence-level generation, and downstream tasks. We find that 1) at
a given perplexity and independent of model sizes, a similar subset of training
tokens see the most significant reduction in loss, with the rest stagnating or
showing double-descent behavior; 2) early in training, all models learn to
reduce the perplexity of grammatical sequences that contain hallucinations,
with small models halting at this suboptimal distribution and larger ones
eventually learning to assign these sequences lower probabilities; 3)
perplexity is a strong predictor of in-context learning performance on 74
multiple-choice tasks from BIG-Bench, and this holds independent of the model
size. Together, these results show that perplexity is more predictive of model
behaviors than model size or training computation.Comment: Accepted to ACL 2023; The code and analysis results are available at
https://github.com/xiamengzhou/training_trajectory_analysi
LEVER: Learning to Verify Language-to-Code Generation with Execution
The advent of pre-trained code language models (CodeLMs) has lead to
significant progress in language-to-code generation. State-of-the-art
approaches in this area combine CodeLM decoding with sample pruning and
reranking using test cases or heuristics based on the execution results.
However, it is challenging to obtain test cases for many real-world
language-to-code applications, and heuristics cannot well capture the semantic
features of the execution results, such as data type and value range, which
often indicates the correctness of the program. In this work, we propose LEVER,
a simple approach to improve language-to-code generation by learning to verify
the generated programs with their execution results. Specifically, we train
verifiers to determine whether a program sampled from the CodeLM is correct or
not based on the natural language input, the program itself and its execution
results. The sampled programs are reranked by combining the verification score
with the CodeLM generation probability, and marginalizing over programs with
the same execution results. On four datasets across the domains of table QA,
math QA and basic Python programming, LEVER consistently improves over the base
CodeLMs (4.6% to 10.9% with code-davinci-002) and achieves new state-of-the-art
results on all of them.Comment: 23 page
bgGLUE: A Bulgarian General Language Understanding Evaluation Benchmark
We present bgGLUE (Bulgarian General Language Understanding Evaluation), a
benchmark for evaluating language models on Natural Language Understanding
(NLU) tasks in Bulgarian. Our benchmark includes NLU tasks targeting a variety
of NLP problems (e.g., natural language inference, fact-checking, named entity
recognition, sentiment analysis, question answering, etc.) and machine learning
tasks (sequence labeling, document-level classification, and regression). We
run the first systematic evaluation of pre-trained language models for
Bulgarian, comparing and contrasting results across the nine tasks in the
benchmark. The evaluation results show strong performance on sequence labeling
tasks, but there is a lot of room for improvement for tasks that require more
complex reasoning. We make bgGLUE publicly available together with the
fine-tuning and the evaluation code, as well as a public leaderboard at
https://bgglue.github.io/, and we hope that it will enable further advancements
in developing NLU models for Bulgarian.Comment: Accepted to ACL 2023 (Main Conference