22 research outputs found
Weakly Supervised Content Selection for Improved Image Captioning
Image captioning involves identifying semantic concepts in the scene and
describing them in fluent natural language. Recent approaches do not explicitly
model the semantic concepts and train the model only for the end goal of
caption generation. Such models lack interpretability and controllability,
primarily due to sub-optimal content selection. We address this problem by
breaking down the captioning task into two simpler, manageable and more
controllable tasks -- skeleton prediction and skeleton-based caption
generation. We approach the former as a weakly supervised task, using a simple
off-the-shelf language syntax parser and avoiding the need for additional human
annotations; the latter uses a supervised-learning approach. We investigate
three methods of conditioning the caption on skeleton in the encoder, decoder
and both. Our compositional model generates significantly better quality
captions on out of domain test images, as judged by human annotators.
Additionally, we demonstrate the cross-language effectiveness of the English
skeleton to other languages including French, Italian, German, Spanish and
Hindi. This compositional nature of captioning exhibits the potential of
unpaired image captioning, thereby reducing the dependence on expensive
image-caption pairs. Furthermore, we investigate the use of skeletons as a knob
to control certain properties of the generated image caption, such as length,
content, and gender expression
Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs
We introduce Lumos, a novel framework for training language agents that
employs a unified data format and a modular architecture based on open-source
large language models (LLMs). Lumos consists of three distinct modules:
planning, grounding, and execution. The planning module breaks down a task into
a series of high-level, tool-agnostic subgoals, which are then made specific by
the grounding module through a set of low-level actions. These actions are
subsequently executed by the execution module, utilizing a range of
off-the-shelf tools and APIs. In order to train these modules effectively,
high-quality annotations of subgoals and actions were collected and are made
available for fine-tuning open-source LLMs for various tasks such as complex
question answering, web tasks, and math problems. Leveraging this unified data
and modular design, Lumos not only achieves comparable or superior performance
to current, state-of-the-art agents, but also exhibits several key advantages:
(1) Lumos surpasses GPT-4/3.5-based agents in complex question answering and
web tasks, while equalling the performance of significantly larger LLM agents
on math tasks; (2) Lumos outperforms open-source agents created through
conventional training methods and those using chain-of-thoughts training; and
(3) Lumos is capable of effectively generalizing to unseen interactive tasks,
outperforming larger LLM-based agents and even exceeding performance of
specialized agents.Comment: Project website: https://allenai.github.io/lumos
Continual Dialogue State Tracking via Example-Guided Question Answering
Dialogue systems are frequently updated to accommodate new services, but
naively updating them by continually training with data for new services in
diminishing performance on previously learnt services. Motivated by the insight
that dialogue state tracking (DST), a crucial component of dialogue systems
that estimates the user's goal as a conversation proceeds, is a simple natural
language understanding task, we propose reformulating it as a bundle of
granular example-guided question answering tasks to minimize the task shift
between services and thus benefit continual learning. Our approach alleviates
service-specific memorization and teaches a model to contextualize the given
question and example to extract the necessary information from the
conversation. We find that a model with just 60M parameters can achieve a
significant boost by learning to learn from in-context examples retrieved by a
retriever trained to identify turns with similar dialogue state changes.
Combining our method with dialogue-level memory replay, our approach attains
state of the art performance on DST continual learning metrics without relying
on any complex regularization or parameter expansion methods.Comment: 11 pages, EMNLP 202
Certainly Uncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness
The ability to acknowledge the inevitable uncertainty in their knowledge and
reasoning is a prerequisite for AI systems to be truly truthful and reliable.
In this paper, we present a taxonomy of uncertainty specific to vision-language
AI systems, distinguishing between epistemic uncertainty (arising from a lack
of information) and aleatoric uncertainty (due to inherent unpredictability),
and further explore finer categories within. Based on this taxonomy, we
synthesize a benchmark dataset, CertainlyUncertain, featuring 178K visual
question answering (VQA) samples as contrastive pairs. This is achieved by 1)
inpainting images to make previously answerable questions into unanswerable
ones; and 2) using image captions to prompt large language models for both
answerable and unanswerable questions. Additionally, we introduce a new metric
confidence-weighted accuracy, that is well correlated with both accuracy and
calibration error, to address the shortcomings of existing metrics.Comment: 26 page