10 research outputs found
Active Prompting with Chain-of-Thought for Large Language Models
The increasing scale of large language models (LLMs) brings emergent
abilities to various complex tasks requiring reasoning, such as arithmetic and
commonsense reasoning. It is known that the effective design of task-specific
prompts is critical for LLMs' ability to produce high-quality answers. In
particular, an effective approach for complex question-and-answer tasks is
example-based prompting with chain-of-thought (CoT) reasoning, which
significantly improves the performance of LLMs. However, current CoT methods
rely on a fixed set of human-annotated exemplars, which are not necessarily the
most effective examples for different tasks. This paper proposes a new method,
Active-Prompt, to adapt LLMs to different tasks with task-specific example
prompts (annotated with human-designed CoT reasoning). For this purpose, we
propose a solution to the key problem of determining which questions are the
most important and helpful ones to annotate from a pool of task-specific
queries. By borrowing ideas from the related problem of uncertainty-based
active learning, we introduce several metrics to characterize the uncertainty
so as to select the most uncertain questions for annotation. Experimental
results demonstrate the superiority of our proposed method, achieving
state-of-the-art on eight complex reasoning tasks. Further analyses of
different uncertainty metrics, pool sizes, zero-shot learning, and
accuracy-uncertainty relationship demonstrate the effectiveness of our method.
Our code will be available at https://github.com/shizhediao/active-prompt.Comment: 20 pages, 3 figures, 11 table
Normalizing Flow with Variational Latent Representation
Normalizing flow (NF) has gained popularity over traditional maximum
likelihood based methods due to its strong capability to model complex data
distributions. However, the standard approach, which maps the observed data to
a normal distribution, has difficulty in handling data distributions with
multiple relatively isolated modes. To overcome this issue, we propose a new
framework based on variational latent representation to improve the practical
performance of NF. The idea is to replace the standard normal latent variable
with a more general latent representation, jointly learned via Variational
Bayes. For example, by taking the latent representation as a discrete sequence,
our framework can learn a Transformer model that generates the latent sequence
and an NF model that generates continuous data distribution conditioned on the
sequence. The resulting method is significantly more powerful than the standard
normalization flow approach for generating data distributions with multiple
modes. Extensive experiments have shown the advantages of NF with variational
latent representation.Comment: 24 pages, 7 figure
On the Difference of BERT-style and CLIP-style Text Encoders
Masked language modeling (MLM) has been one of the most popular pretraining
recipes in natural language processing, e.g., BERT, one of the representative
models. Recently, contrastive language-image pretraining (CLIP) has also
attracted attention, especially its vision models that achieve excellent
performance on a broad range of vision tasks. However, few studies are
dedicated to studying the text encoders learned by CLIP. In this paper, we
analyze the difference between BERT-style and CLIP-style text encoders from
three experiments: (i) general text understanding, (ii) vision-centric text
understanding, and (iii) text-to-image generation. Experimental analyses show
that although CLIP-style text encoders underperform BERT-style ones for general
text understanding tasks, they are equipped with a unique ability, i.e.,
synesthesia, for the cross-modal association, which is more similar to the
senses of humans.Comment: Natural Language Processing. 10 pages, 1 figure. Findings of ACL-202
UniTime: A Language-Empowered Unified Model for Cross-Domain Time Series Forecasting
Multivariate time series forecasting plays a pivotal role in contemporary web
technologies. In contrast to conventional methods that involve creating
dedicated models for specific time series application domains, this research
advocates for a unified model paradigm that transcends domain boundaries.
However, learning an effective cross-domain model presents the following
challenges. First, various domains exhibit disparities in data characteristics,
e.g., the number of variables, posing hurdles for existing models that impose
inflexible constraints on these factors. Second, the model may encounter
difficulties in distinguishing data from various domains, leading to suboptimal
performance in our assessments. Third, the diverse convergence rates of time
series domains can also result in compromised empirical performance. To address
these issues, we propose UniTime for effective cross-domain time series
learning. Concretely, UniTime can flexibly adapt to data with varying
characteristics. It also uses domain instructions and a Language-TS Transformer
to offer identification information and align two modalities. In addition,
UniTime employs masking to alleviate domain convergence speed imbalance issues.
Our extensive experiments demonstrate the effectiveness of UniTime in advancing
state-of-the-art forecasting performance and zero-shot transferability
Plum: Prompt Learning using Metaheuristic
Since the emergence of large language models, prompt learning has become a
popular method for optimizing and customizing these models. Special prompts,
such as Chain-of-Thought, have even revealed previously unknown reasoning
capabilities within these models. However, the progress of discovering
effective prompts has been slow, driving a desire for general prompt
optimization methods. Unfortunately, few existing prompt learning methods
satisfy the criteria of being truly "general", i.e., automatic, discrete,
black-box, gradient-free, and interpretable all at once. In this paper, we
introduce metaheuristics, a branch of discrete non-convex optimization methods
with over 100 options, as a promising approach to prompt learning. Within our
paradigm, we test six typical methods: hill climbing, simulated annealing,
genetic algorithms with/without crossover, tabu search, and harmony search,
demonstrating their effectiveness in black-box prompt learning and
Chain-of-Thought prompt tuning. Furthermore, we show that these methods can be
used to discover more human-understandable prompts that were previously
unknown, opening the door to a cornucopia of possibilities in prompt
optimization. We release all the codes in
\url{https://github.com/research4pan/Plum}
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
Generative foundation models are susceptible to implicit biases that can
arise from extensive unsupervised training data. Such biases can produce
suboptimal samples, skewed outcomes, and unfairness, with potentially
significant repercussions. Consequently, aligning these models with human
ethics and preferences is an essential step toward ensuring their responsible
and effective deployment in real-world applications. Prior research has
primarily employed Reinforcement Learning from Human Feedback (RLHF) as a means
of addressing this problem, wherein generative models are fine-tuned using RL
algorithms guided by a human-feedback-informed reward model. However, the
inefficiencies and instabilities associated with RL algorithms frequently
present substantial obstacles to the successful alignment of generative models,
necessitating the development of a more robust and streamlined approach. To
this end, we introduce a new framework, Reward rAnked FineTuning (RAFT),
designed to align generative models more effectively. Utilizing a reward model
and a sufficient number of samples, our approach selects the high-quality
samples, discarding those that exhibit undesired behavior, and subsequently
assembles a streaming dataset. This dataset serves as the basis for aligning
the generative model and can be employed under both offline and online
settings. Notably, the sample generation process within RAFT is gradient-free,
rendering it compatible with black-box generators. Through extensive
experiments, we demonstrate that our proposed algorithm exhibits strong
performance in the context of both large language models and diffusion models
Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards
Fine-grained control over large language models (LLMs) remains a significant
challenge, hindering their adaptability to diverse user needs. While
Reinforcement Learning from Human Feedback (RLHF) shows promise in aligning
LLMs, its reliance on scalar rewards often limits its ability to capture
diverse user preferences in real-world applications. To address this
limitation, we introduce the Directional Preference Alignment (DPA) framework.
Unlike the scalar-reward RLHF, DPA incorporates multi-objective reward modeling
to represent diverse preference profiles. Additionally, DPA models user
preferences as directions (i.e., unit vectors) in the reward space to achieve
user-dependent preference control. Our method involves training a
multi-objective reward model and then fine-tuning the LLM with a
preference-conditioned variant of Rejection Sampling Finetuning (RSF), an RLHF
method adopted by Llama 2. This method enjoys a better performance trade-off
across various reward objectives. In comparison with the scalar-reward RLHF,
DPA offers users intuitive control over LLM generation: they can arithmetically
specify their desired trade-offs (e.g., more helpfulness with less verbosity).
We also validate the effectiveness of DPA with real-world alignment experiments
on Mistral-7B. Our method provides straightforward arithmetic control over the
trade-off between helpfulness and verbosity while maintaining competitive
performance with strong baselines such as Direct Preference Optimization (DPO).Comment: The code and model are released at
https://github.com/Haoxiang-Wang/directional-preference-alignmen
Mitigating the Alignment Tax of RLHF
LLMs acquire a wide range of abilities during pre-training, but aligning LLMs
under Reinforcement Learning with Human Feedback (RLHF) can lead to forgetting,
which is also known as the alignment tax. To empirically verify this
hypothesis, we conducted experiments with existing RLHF algorithms using
OpenLLaMA-3B, which revealed a pronounced alignment tax in NLP tasks. On the
other hand, despite various techniques to mitigate forgetting, they are often
at odds with the RLHF performance, leading to a trade-off between reward
maximization and forgetting mitigation.
In light of the above pressing issue in aligning LLMs, in this paper we
explore model averaging, which interpolates between pre and post RLHF model
weights, to achieve a more efficient reward-tax Pareto front. To understand its
effectiveness, We offer theoretical insights into model averaging, revealing
that it enhances performance Pareto front by increasing feature diversity on
the layers where tasks share overlapped feature spaces. Empirical evidence
corroborates our analysis by showing the benefits of averaging low-level
transformer layers. Building on the analysis and the observation that averaging
different layers of the transformer leads to significantly different reward-tax
trade-offs, we propose Adaptive Model Averaging (AMA) to adaptively find
various combination ratios of model layers. AMA seeks to maximize the alignment
reward while incurring minimal alignment tax. Moreover, we validate AMA's
performance across a range of RLHF algorithms over OpenLLaMA-3B and further
extend our findings to Mistral-7B.Comment: 28 Page