70 research outputs found
Towards A Unified View of Sparse Feed-Forward Network in Pretraining Large Language Model
Large and sparse feed-forward layers (S-FFN) such as Mixture-of-Experts (MoE)
have proven effective in scaling up Transformers model size for
\textit{pretraining} large language models. By only activating part of the FFN
parameters conditioning on input, S-FFN improves generalization performance
while keeping training and inference costs (in FLOPs) fixed. In this work, we
analyzed two major design choices of S-FFN: the memory block (a.k.a. expert)
size and the memory block selection method under a general conceptual framework
of sparse neural memory. Using this unified framework, we compare several S-FFN
architectures for language modeling and provide insights into their relative
efficacy and efficiency. We found a simpler selection method --
\textbf{\texttt{Avg-K}} that selects blocks through their mean aggregated
hidden states, achieving lower perplexity in language model pretraining
compared to existing MoE architectures including Switch Transformer (Fedus et
al., 2021) and HashLayer (Roller et al., 2021).Comment: Accepted to EMNLP 202
Exploring the Relationship Among International Students' English Self-efficacy, Using English to Learn Self-efficacy, and Academic Self-efficacy
One of the major challenges for international students to pursue academic goals in the United States is English language proficiency, which often negatively affects academic success. Even students with confidence in their English language proficiency encounter challenges using English in class. Previous research indicates self-efficacy positively predicts English language proficiency and academic achievement. Therefore, the current study hypothesized a model using self-efficacy in using English to learn as a mediator between English and academic self-efficacy. The structural equation modeling results indicate English self-efficacy indirectly influenced international students’ academic self-efficacy through their using English to learn self-efficacy. Findings suggest using English and using English to learn self-efficacy are two distinct constructs. These results warrant academic English support for non-native English speaking international students
Reimagining Retrieval Augmented Language Models for Answering Queries
We present a reality check on large language models and inspect the promise
of retrieval augmented language models in comparison. Such language models are
semi-parametric, where models integrate model parameters and knowledge from
external data sources to make their predictions, as opposed to the parametric
nature of vanilla large language models. We give initial experimental findings
that semi-parametric architectures can be enhanced with views, a query
analyzer/planner, and provenance to make a significantly more powerful system
for question answering in terms of accuracy and efficiency, and potentially for
other NLP task
LEVER: Learning to Verify Language-to-Code Generation with Execution
The advent of pre-trained code language models (CodeLMs) has lead to
significant progress in language-to-code generation. State-of-the-art
approaches in this area combine CodeLM decoding with sample pruning and
reranking using test cases or heuristics based on the execution results.
However, it is challenging to obtain test cases for many real-world
language-to-code applications, and heuristics cannot well capture the semantic
features of the execution results, such as data type and value range, which
often indicates the correctness of the program. In this work, we propose LEVER,
a simple approach to improve language-to-code generation by learning to verify
the generated programs with their execution results. Specifically, we train
verifiers to determine whether a program sampled from the CodeLM is correct or
not based on the natural language input, the program itself and its execution
results. The sampled programs are reranked by combining the verification score
with the CodeLM generation probability, and marginalizing over programs with
the same execution results. On four datasets across the domains of table QA,
math QA and basic Python programming, LEVER consistently improves over the base
CodeLMs (4.6% to 10.9% with code-davinci-002) and achieves new state-of-the-art
results on all of them.Comment: 23 page
Training Trajectories of Language Models Across Scales
Scaling up language models has led to unprecedented performance gains, but
little is understood about how the training dynamics change as models get
larger. How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors? In this
paper, we analyze the intermediate training checkpoints of differently sized
OPT models (Zhang et al.,2022)--from 125M to 175B parameters--on next-token
prediction, sequence-level generation, and downstream tasks. We find that 1) at
a given perplexity and independent of model sizes, a similar subset of training
tokens see the most significant reduction in loss, with the rest stagnating or
showing double-descent behavior; 2) early in training, all models learn to
reduce the perplexity of grammatical sequences that contain hallucinations,
with small models halting at this suboptimal distribution and larger ones
eventually learning to assign these sequences lower probabilities; 3)
perplexity is a strong predictor of in-context learning performance on 74
multiple-choice tasks from BIG-Bench, and this holds independent of the model
size. Together, these results show that perplexity is more predictive of model
behaviors than model size or training computation.Comment: Accepted to ACL 2023; The code and analysis results are available at
https://github.com/xiamengzhou/training_trajectory_analysi
- …