34 research outputs found
VIM: Probing Multimodal Large Language Models for Visual Embedded Instruction Following
We introduce VISUAL EMBEDDED INSTRUCTION (VIM), a new framework designed to
evaluate the visual instruction following capability of Multimodal Large
Language Models (MLLMs). As illustrated in Figure 2, VIM challenges the MLLMs
by embedding the instructions into the visual scenes, demanding strong visual
interpretative skills for instruction following. We adapt VIM to various
benchmarks, including VQAv2, MME, MM-Vet, and RefCOCO series, compose a VIM
bench, and probe diverse MLLMs across three distinct in-context learning
settings: Zero Shot, One Shot, and Pair Shot. We observe that there is a
significant performance disparity between the open-source MLLMs and GPT-4V,
implying that their proficiency in visual instruction comprehension is not up
to par. Our results highlight a promising direction for the enhancement of
MLLMs capabilities on instruction following. We aim VIM to serve as a useful
norm for advancing the state of the art and driving further progress in the
field.Comment: 20 pages, 8 figures, 20 table
STEER: Unified Style Transfer with Expert Reinforcement
While text style transfer has many applications across natural language
processing, the core premise of transferring from a single source style is
unrealistic in a real-world setting. In this work, we focus on arbitrary style
transfer: rewriting a text from an arbitrary, unknown style to a target style.
We propose STEER: Unified Style Transfer with Expert Reinforcement, a unified
frame-work developed to overcome the challenge of limited parallel data for
style transfer. STEER involves automatically generating a corpus of
style-transfer pairs using a product of experts during decoding. The generated
offline data is then used to pre-train an initial policy before switching to
online, off-policy reinforcement learning for further improvements via
fine-grained reward signals. STEER is unified and can transfer to multiple
target styles from an arbitrary, unknown source style, making it particularly
flexible and efficient.
Experimental results on a challenging dataset with text from a diverse set of
styles demonstrate state-of-the-art results compared to competitive baselines.
Remarkably, STEER outperforms the 175B parameter instruction-tuned GPT-3 on
overall style transfer quality, despite being 226 times smaller in size. We
also show STEER is robust, maintaining its style transfer capabilities on
out-of-domain data, and surpassing nearly all baselines across various styles.
The success of our method highlights the potential of RL algorithms when
augmented with controllable decoding to overcome the challenge of limited data
supervision.Comment: for associated code, see
https://github.com/shallinan1/STEERStyleTransfe
Symbolic Knowledge Distillation: from General Language Models to Commonsense Models
The common practice for training commonsense models has gone
from-human-to-corpus-to-machine: humans author commonsense knowledge graphs in
order to train commonsense models. In this work, we investigate an alternative,
from-machine-to-corpus-to-machine: general language models author these
commonsense knowledge graphs to train commonsense models. Our study leads to a
new framework, Symbolic Knowledge Distillation. As with prior art in Knowledge
Distillation (Hinton et al., 2015), our approach uses larger models to teach
smaller models. A key difference is that we distill knowledge symbolically-as
text-in addition to the neural model. We also distill only one aspect-the
commonsense of a general language model teacher, allowing the student to be a
different type, a commonsense model. Altogether, we show that careful prompt
engineering and a separately trained critic model allow us to selectively
distill high-quality causal commonsense from GPT-3, a general language model.
Empirical results demonstrate that, for the first time, a human-authored
commonsense knowledge graph is surpassed by our automatically distilled variant
in all three criteria: quantity, quality, and diversity. In addition, it
results in a neural commonsense model that surpasses the teacher model's
commonsense capabilities despite its 100x smaller size. We apply this to the
ATOMIC resource, and share our new symbolic knowledge graph and commonsense
models
In Search of the Long-Tail: Systematic Generation of Long-Tail Knowledge via Logical Rule Guided Search
Since large language models have approached human-level performance on many
tasks, it has become increasingly harder for researchers to find tasks that are
still challenging to the models. Failure cases usually come from the long-tail
distribution - data that an oracle language model could assign a probability on
the lower end of its distribution. Current methodology such as prompt
engineering or crowdsourcing are insufficient for creating long-tail examples
because humans are constrained by cognitive bias. We propose a
Logic-Induced-Knowledge-Search (LINK) framework for systematically generating
long-tail knowledge statements. Grounded by a symbolic rule, we search for
long-tail values for each variable of the rule by first prompting a LLM, then
verifying the correctness of the values with a critic, and lastly pushing for
the long-tail distribution with a reranker. With this framework we construct a
dataset, Logic-Induced-Long-Tail (LINT), consisting of 200 symbolic rules and
50K knowledge statements spanning across four domains. Human annotations find
that 84% of the statements in LINT are factually correct. In contrast, ChatGPT
and GPT4 struggle with directly generating long-tail statements under the
guidance of logic rules, each only getting 56% and 78% of their statements
correct. Moreover, their "long-tail" generations in fact fall into the higher
likelihood range, and thus are not really long-tail. Our findings suggest that
LINK is effective for generating data in the long-tail distribution while
enforcing quality. LINT can be useful for systematically evaluating LLMs'
capabilities in the long-tail distribution. We challenge the models with a
simple entailment classification task using samples from LINT. We find that
ChatGPT and GPT4's capability in identifying incorrect knowledge drop by ~3% in
the long-tail distribution compared to head distribution
Tidal variation shaped microplastic enrichment patterns in mangrove blue carbon ecosystem of northern Beibu Gulf, China
Mangroves are considered to be a sink for microplastics (MPs) due to their unique characteristics. Previous studies mainly focused on the spatial distribution of MPs, but few researchers have addressed the influence of tidal variation on this distribution, especially since the MP total number in mangroves was unknown. In this study, surface sediment samples were collected in mangroves from the Beibu Gulf, South China Sea, and the abundance, composition, and number of MPs were investigated. The results showed that MPs were widely present in all mangrove sediment samples, with abundances ranging from 26.67 ± 9.43 to 239.94 ± 37.80 items/kg. The distribution of MPs was heterogeneous among different sampling sites, with the highest levels in the Shankou (SK) area. The MP abundance in the same mangrove forest gradually increased from the low tidal zone to the high tidal zone, with the enrichment factor ranging from 1.50 to 4.00. The MP abundance was significantly correlated with particulate organic carbon (POC) (n = 12, R = 0.664, p < 0.05). Results showed that mangroves had an interception effect on MPs and factors affecting MP distribution in mangrove sediments included not only tides but also human activities, such as aquaculture, agriculture, and residential life. Finally, this paper estimated the MP total number in mangroves at different sampling areas and tidal zones, and the middle tidal zone was considered to be more accurate for MP pollution assessment in mangroves
Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties
Human values are crucial to human decision-making. Value pluralism is the
view that multiple correct values may be held in tension with one another
(e.g., when considering lying to a friend to protect their feelings, how does
one balance honesty with friendship?). As statistical learners, AI systems fit
to averages by default, washing out these potentially irreducible value
conflicts. To improve AI systems to better reflect value pluralism, the
first-order challenge is to explore the extent to which AI systems can model
pluralistic human values, rights, and duties as well as their interaction.
We introduce ValuePrism, a large-scale dataset of 218k values, rights, and
duties connected to 31k human-written situations. ValuePrism's contextualized
values are generated by GPT-4 and deemed high-quality by human annotators 91%
of the time. We conduct a large-scale study with annotators across diverse
social and demographic backgrounds to try to understand whose values are
represented.
With ValuePrism, we build Kaleido, an open, light-weight, and structured
language-based multi-task model that generates, explains, and assesses the
relevance and valence (i.e., support or oppose) of human values, rights, and
duties within a specific context. Humans prefer the sets of values output by
our system over the teacher GPT-4, finding them more accurate and with broader
coverage. In addition, we demonstrate that Kaleido can help explain variability
in human decision-making by outputting contrasting values. Finally, we show
that Kaleido's representations transfer to other philosophical frameworks and
datasets, confirming the benefit of an explicit, modular, and interpretable
approach to value pluralism. We hope that our work will serve as a step to
making more explicit the implicit values behind human decision-making and to
steering AI systems to make decisions that are more in accordance with them
SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization
Data scarcity has been a long standing issue in the field of open-domain
social dialogue. To quench this thirst, we present SODA: the first publicly
available, million-scale high-quality social dialogue dataset. By
contextualizing social commonsense knowledge from a knowledge graph, we are
able to distill an exceptionally broad spectrum of social interactions from a
large language model. Human evaluation shows that conversations in SODA are
more consistent, specific, and (surprisingly) natural than those in prior
human-authored datasets.
Using SODA, we train COSMO: a generalizable conversation model that is
significantly more natural and consistent on unseen datasets than
best-performing conversation models (e.g., GODEL, BlenderBot-1, Koala, Vicuna).
Experiments reveal COSMO is sometimes even preferred to the original
human-written gold responses. Additionally, our results shed light on the
distinction between knowledge-enriched conversations and natural social
chitchats. We plan to make our data, model, and code public.Comment: EMNLP 2023. Dataset, model, and code can be found at
https://hyunw.kim/sodavers
Inference-Time Policy Adapters (IPA): Tailoring Extreme-Scale LMs without Fine-tuning
Large language models excel at a variety of language tasks when prompted with
examples or instructions. Yet controlling these models through prompting alone
is limited. Tailoring language models through fine-tuning (e.g., via
reinforcement learning) can be effective, but it is expensive and requires
model access.
We propose Inference-time Policy Adapters (IPA), which efficiently tailors a
language model such as GPT-3 without fine-tuning it. IPA guides a large base
model during decoding time through a lightweight policy adaptor trained to
optimize an arbitrary user objective with reinforcement learning.
On five challenging text generation tasks, such as toxicity reduction and
open-domain generation, IPA consistently brings significant improvements over
off-the-shelf language models. It outperforms competitive baseline methods,
sometimes even including expensive fine-tuning. In particular, tailoring GPT-2
with IPA can outperform GPT-3, while tailoring GPT- 3 with IPA brings a major
performance boost over GPT-3 (and sometimes even over GPT-4). Our promising
results highlight the potential of IPA as a lightweight alternative to
tailoring extreme-scale language models