4 research outputs found
Can Language Models perform Abductive Commonsense Reasoning?
Abductive Reasoning is a task of inferring the most plausible hypothesis
given a set of observations. In literature, the community has approached to
solve this challenge by classifying/generating a likely hypothesis that does
not contradict with a past observation and future observation. Some of the most
well-known benchmarks that tackle this problem are aNLI and aNLG (pronounced as
alpha-NLI and alpha-NLG). In this report, I review over some of the
methodologies that were attempted to solve this challenge, re-implement the
baseline models, and analyze some of the weaknesses that current approaches
have. The code and the re-implemented results are available at this link.Comment: 6 page
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
Evaluation of Large Language Models (LLMs) is challenging because aligning to
human values requires the composition of multiple skills and the required set
of skills varies depending on the instruction. Recent studies have evaluated
the performance of LLMs in two ways, (1) automatic evaluation on several
independent benchmarks and (2) human or machined-based evaluation giving an
overall score to the response. However, both settings are coarse-grained
evaluations, not considering the nature of user instructions that require
instance-wise skill composition, which limits the interpretation of the true
capabilities of LLMs. In this paper, we introduce FLASK (Fine-grained Language
Model Evaluation based on Alignment SKill Sets), a fine-grained evaluation
protocol that can be used for both model-based and human-based evaluation which
decomposes coarse-level scoring to an instance-wise skill set-level.
Specifically, we define 12 fine-grained skills needed for LLMs to follow
open-ended user instructions and construct an evaluation set by allocating a
set of skills for each instance. Additionally, by annotating the target domains
and difficulty level for each instance, FLASK provides a holistic view with a
comprehensive analysis of a model's performance depending on skill, domain, and
difficulty. Through using FLASK, we compare multiple open-sourced and
proprietary LLMs and observe highly-correlated findings between model-based and
human-based evaluations. FLASK enables developers to more accurately measure
the model performance and how it can be improved by analyzing factors that make
LLMs proficient in particular skills. For practitioners, FLASK can be used to
recommend suitable models for particular situations through comprehensive
comparison among various LLMs. We release the evaluation data and code
implementation at https://github.com/kaistAI/FLASK
The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning
Language models (LMs) with less than 100B parameters are known to perform
poorly on chain-of-thought (CoT) reasoning in contrast to large LMs when
solving unseen tasks. In this work, we aim to equip smaller LMs with the
step-by-step reasoning capability by instruction tuning with CoT rationales. In
order to achieve this goal, we first introduce a new instruction-tuning dataset
called the CoT Collection, which augments the existing Flan Collection
(including only 9 CoT tasks) with additional 1.84 million rationales across
1,060 tasks. We show that CoT fine-tuning Flan-T5 (3B & 11B) with CoT
Collection enables smaller LMs to have better CoT capabilities on unseen tasks.
On the BIG-Bench-Hard (BBH) benchmark, we report an average improvement of
+4.34% (Flan-T5 3B) and +2.60% (Flan-T5 11B), in terms of zero-shot task
accuracy. Furthermore, we show that instruction tuning with CoT Collection
allows LMs to possess stronger few-shot learning capabilities on 4
domain-specific tasks, resulting in an improvement of +2.24% (Flan-T5 3B) and
+2.37% (Flan-T5 11B), even outperforming ChatGPT utilizing demonstrations until
the max length by a +13.98% margin. Our code, the CoT Collection data, and
model checkpoints are publicly available.Comment: EMNLP 2023 (Main Conference
Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging
While Reinforcement Learning from Human Feedback (RLHF) aligns Large Language
Models (LLMs) with general, aggregate human preferences, it is suboptimal for
learning diverse, individual perspectives. In this work, we study Reinforcement
Learning from Personalized Human Feedback (RLPHF) problem, wherein LLMs are
aligned to multiple (sometimes conflicting) preferences by modeling alignment
as a Multi-Objective Reinforcement Learning (MORL) problem. Compared to strong
single-objective baselines, we show that we can achieve personalized alignment
by decomposing preferences into multiple dimensions. These dimensions are
defined based on personalizations that are declared as desirable by the user.
In this work, we show that they can be efficiently trained independently in a
distributed manner and combined effectively post-hoc through parameter merging.
The code is available at https://github.com/joeljang/RLPHF.Comment: Preprin