20 research outputs found
Learning to Compare for Better Training and Evaluation of Open Domain Natural Language Generation Models
Automated evaluation of open domain natural language generation (NLG) models
remains a challenge and widely used metrics such as BLEU and Perplexity can be
misleading in some cases. In our paper, we propose to evaluate natural language
generation models by learning to compare a pair of generated sentences by
fine-tuning BERT, which has been shown to have good natural language
understanding ability. We also propose to evaluate the model-level quality of
NLG models with sample-level comparison results with skill rating system. While
able to be trained in a fully self-supervised fashion, our model can be further
fine-tuned with a little amount of human preference annotation to better
imitate human judgment. In addition to evaluating trained models, we propose to
apply our model as a performance indicator during training for better
hyperparameter tuning and early-stopping. We evaluate our approach on both
story generation and chit-chat dialogue response generation. Experimental
results show that our model correlates better with human preference compared
with previous automated evaluation approaches. Training with the proposed
metric yields better performance in human evaluation, which further
demonstrates the effectiveness of the proposed model.Comment: AAAI 202
Modular Transformers: Compressing Transformers into Modularized Layers for Flexible Efficient Inference
Pre-trained Transformer models like T5 and BART have advanced the state of
the art on a wide range of text generation tasks. Compressing these models into
smaller ones has become critically important for practical use. Common neural
network compression techniques such as knowledge distillation or quantization
are limited to static compression where the compression ratio is fixed. In this
paper, we introduce Modular Transformers, a modularized encoder-decoder
framework for flexible sequence-to-sequence model compression. Modular
Transformers train modularized layers that have the same function of two or
more consecutive layers in the original model via module replacing and
knowledge distillation. After training, the modularized layers can be flexibly
assembled into sequence-to-sequence models that meet different
performance-efficiency trade-offs. Experimental results show that after a
single training phase, by simply varying the assembling strategy, Modular
Transformers can achieve flexible compression ratios from 1.1x to 6x with
little to moderate relative performance drop.Comment: ACL 2023 Finding
Commonsense Knowledge Transfer for Pre-trained Language Models
Despite serving as the foundation models for a wide range of NLP benchmarks,
pre-trained language models have shown limited capabilities of acquiring
implicit commonsense knowledge from self-supervision alone, compared to
learning linguistic and factual knowledge that appear more explicitly in the
surface patterns in text. In this work, we introduce commonsense knowledge
transfer, a framework to transfer the commonsense knowledge stored in a neural
commonsense knowledge model to a general-purpose pre-trained language model. It
first exploits general texts to form queries for extracting commonsense
knowledge from the neural commonsense knowledge model and then refines the
language model with two self-supervised objectives: commonsense mask infilling
and commonsense relation prediction, which align human language with the
underlying commonsense knowledge. Empirical results show that our approach
consistently improves the model's performance on downstream tasks that require
commonsense reasoning. Moreover, we find that the improvement is more
significant in the few-shot setting. This suggests that our approach helps
language models better transfer to downstream tasks without extensive
supervision by injecting commonsense knowledge into their parameters.Comment: ACL 2023 Finding
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis
Recent research has highlighted the importance of dataset size in scaling
language models. However, large language models (LLMs) are notoriously
token-hungry during pre-training, and high-quality text data on the web is
approaching its scaling limit for LLMs. To further enhance LLMs, a
straightforward approach is to repeat the pre-training data for additional
epochs. In this study, we empirically investigate three key aspects under this
approach. First, we explore the consequences of repeating pre-training data,
revealing that the model is susceptible to overfitting, leading to multi-epoch
degradation. Second, we examine the key factors contributing to multi-epoch
degradation, finding that significant factors include dataset size, model
parameters, and training objectives, while less influential factors consist of
dataset quality and model FLOPs. Finally, we explore whether widely used
regularization can alleviate multi-epoch degradation. Most regularization
techniques do not yield significant improvements, except for dropout, which
demonstrates remarkable effectiveness but requires careful tuning when scaling
up the model size. Additionally, we discover that leveraging mixture-of-experts
(MoE) enables cost-effective and efficient hyper-parameter tuning for
computationally intensive dense LLMs with comparable trainable parameters,
potentially impacting efficient LLM development on a broader scale.Comment: Accepted at NeurIPS 202
X-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
Vision language pre-training aims to learn alignments between vision and
language from a large amount of data. We proposed multi-grained vision language
pre-training, a unified approach which can learn vision language alignments in
multiple granularity. This paper advances the proposed method by unifying image
and video encoding in one model and scaling up the model with large-scale data.
We present X-VLM, a pre-trained VLM with a modular architecture for both
image-text tasks and video-text tasks. Experiment results show that X-VLM
performs the best on base and large scale for both image-text and video-text
tasks, making a good trade-off between performance and model scale. Moreover,
we show that the modular design of X-VLM results in high transferability
for X-VLM to be utilized in any language or domain. For example, by simply
replacing the text encoder with XLM-R, X-VLM outperforms state-of-the-art
multilingual multi-modal pre-trained models without any multilingual
pre-training. The code and pre-trained models will be available at
github.com/zengyan-97/X2-VLM.Comment: 21 pages, 8 figure