Search CORE

70 research outputs found

Towards A Unified View of Sparse Feed-Forward Network in Pretraining Large Language Model

Author: Dettmers Tim
Li Xian
Lin Xi Victoria
Liu Zeyu Leo
Stoyanov Veselin
Publication venue
Publication date: 23/10/2023
Field of study

Large and sparse feed-forward layers (S-FFN) such as Mixture-of-Experts (MoE) have proven effective in scaling up Transformers model size for \textit{pretraining} large language models. By only activating part of the FFN parameters conditioning on input, S-FFN improves generalization performance while keeping training and inference costs (in FLOPs) fixed. In this work, we analyzed two major design choices of S-FFN: the memory block (a.k.a. expert) size and the memory block selection method under a general conceptual framework of sparse neural memory. Using this unified framework, we compare several S-FFN architectures for language modeling and provide insights into their relative efficacy and efficiency. We found a simpler selection method -- \textbf{\texttt{Avg-K}} that selects blocks through their mean aggregated hidden states, achieving lower perplexity in language model pretraining compared to existing MoE architectures including Switch Transformer (Fedus et al., 2021) and HashLayer (Roller et al., 2021).Comment: Accepted to EMNLP 202

arXiv.org e-Print Archive

Exploring the Relationship Among International Students' English Self-efficacy, Using English to Learn Self-efficacy, and Academic Self-efficacy

Author: Chih-hsuan Wang
Jamie Harrison
Victoria Cardullo
Xi Lin
Publication venue: 'STAR Scholars Network'
Publication date: 01/01/2018
Field of study

One of the major challenges for international students to pursue academic goals in the United States is English language proficiency, which often negatively affects academic success. Even students with confidence in their English language proficiency encounter challenges using English in class. Previous research indicates self-efficacy positively predicts English language proficiency and academic achievement. Therefore, the current study hypothesized a model using self-efficacy in using English to learn as a mediator between English and academic self-efficacy. The structural equation modeling results indicate English self-efficacy indirectly influenced international students’ academic self-efficacy through their using English to learn self-efficacy. Findings suggest using English and using English to learn self-efficacy are two distinct constructs. These results warrant academic English support for non-native English speaking international students

Directory of Open Access Journals

Reimagining Retrieval Augmented Language Models for Answering Queries

Author: Halevy Alon
James Richard
Li Yuliang
Lin Xi Victoria
Rodriguez Pedro
Tan Wang-Chiew
Yih Scott
Publication venue
Publication date: 01/06/2023
Field of study

We present a reality check on large language models and inspect the promise of retrieval augmented language models in comparison. Such language models are semi-parametric, where models integrate model parameters and knowledge from external data sources to make their predictions, as opposed to the parametric nature of vanilla large language models. We give initial experimental findings that semi-parametric architectures can be enhanced with views, a query analyzer/planner, and provenance to make a significantly more powerful system for question answering in terms of accuracy and efficiency, and potentially for other NLP task

arXiv.org e-Print Archive

LEVER: Learning to Verify Language-to-Code Generation with Execution

Author: Iyer Srini
Lin Xi Victoria
Ni Ansong
Radev Dragomir
Stoyanov Ves
Wang Sida I.
Yih Wen-tau
Publication venue
Publication date: 16/02/2023
Field of study

The advent of pre-trained code language models (CodeLMs) has lead to significant progress in language-to-code generation. State-of-the-art approaches in this area combine CodeLM decoding with sample pruning and reranking using test cases or heuristics based on the execution results. However, it is challenging to obtain test cases for many real-world language-to-code applications, and heuristics cannot well capture the semantic features of the execution results, such as data type and value range, which often indicates the correctness of the program. In this work, we propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the CodeLM is correct or not based on the natural language input, the program itself and its execution results. The sampled programs are reranked by combining the verification score with the CodeLM generation probability, and marginalizing over programs with the same execution results. On four datasets across the domains of table QA, math QA and basic Python programming, LEVER consistently improves over the base CodeLMs (4.6% to 10.9% with code-davinci-002) and achieves new state-of-the-art results on all of them.Comment: 23 page

arXiv.org e-Print Archive

Training Trajectories of Language Models Across Scales

Author: Artetxe Mikel
Chen Danqi
Lin Xi Victoria
Pasunuru Ramakanth
Stoyanov Ves
Xia Mengzhou
Zettlemoyer Luke
Zhou Chunting
Publication venue
Publication date: 29/05/2023
Field of study

Scaling up language models has led to unprecedented performance gains, but little is understood about how the training dynamics change as models get larger. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors? In this paper, we analyze the intermediate training checkpoints of differently sized OPT models (Zhang et al.,2022)--from 125M to 175B parameters--on next-token prediction, sequence-level generation, and downstream tasks. We find that 1) at a given perplexity and independent of model sizes, a similar subset of training tokens see the most significant reduction in loss, with the rest stagnating or showing double-descent behavior; 2) early in training, all models learn to reduce the perplexity of grammatical sequences that contain hallucinations, with small models halting at this suboptimal distribution and larger ones eventually learning to assign these sequences lower probabilities; 3) perplexity is a strong predictor of in-context learning performance on 74 multiple-choice tasks from BIG-Bench, and this holds independent of the model size. Together, these results show that perplexity is more predictive of model behaviors than model size or training computation.Comment: Accepted to ACL 2023; The code and analysis results are available at https://github.com/xiamengzhou/training_trajectory_analysi

arXiv.org e-Print Archive