20 research outputs found
Test-Time Amendment with a Coarse Classifier for Fine-Grained Classification
We investigate the problem of reducing mistake severity for fine-grained
classification. Fine-grained classification can be challenging, mainly due to
the requirement of domain expertise for accurate annotation. However, humans
are particularly adept at performing coarse classification as it requires
relatively low levels of expertise. To this end, we present a novel approach
for Post-Hoc Correction called Hierarchical Ensembles (HiE) that utilizes label
hierarchy to improve the performance of fine-grained classification at
test-time using the coarse-grained predictions. By only requiring the parents
of leaf nodes, our method significantly reduces avg. mistake severity while
improving top-1 accuracy on the iNaturalist-19 and tieredImageNet-H datasets,
achieving a new state-of-the-art on both benchmarks. We also investigate the
efficacy of our approach in the semi-supervised setting. Our approach brings
notable gains in top-1 accuracy while significantly decreasing the severity of
mistakes as training data decreases for the fine-grained classes. The
simplicity and post-hoc nature of HiE renders it practical to be used with any
off-the-shelf trained model to improve its predictions further.Comment: 8 pages, 2 figures, 3 tables, Accepted at NeurIPS 202
Strategic Reasoning with Language Models
Strategic reasoning enables agents to cooperate, communicate, and compete
with other agents in diverse situations. Existing approaches to solving
strategic games rely on extensive training, yielding strategies that do not
generalize to new scenarios or games without retraining. Large Language Models
(LLMs), with their ability to comprehend and generate complex, context-rich
language, could prove powerful as tools for strategic gameplay. This paper
introduces an approach that uses pretrained LLMs with few-shot chain-of-thought
examples to enable strategic reasoning for AI agents. Our approach uses
systematically generated demonstrations of reasoning about states, values, and
beliefs to prompt the model. Using extensive variations of simple matrix games,
we show that strategies that are derived based on systematically generated
prompts generalize almost perfectly to new game structures, alternate
objectives, and hidden information. Additionally, we demonstrate our approach
can lead to human-like negotiation strategies in realistic scenarios without
any extra training or fine-tuning. Our results highlight the ability of LLMs,
guided by systematic reasoning demonstrations, to adapt and excel in diverse
strategic scenarios
Understanding Social Reasoning in Language Models with Language Models
As Large Language Models (LLMs) become increasingly integrated into our
everyday lives, understanding their ability to comprehend human mental states
becomes critical for ensuring effective interactions. However, despite the
recent attempts to assess the Theory-of-Mind (ToM) reasoning capabilities of
LLMs, the degree to which these models can align with human ToM remains a
nuanced topic of exploration. This is primarily due to two distinct challenges:
(1) the presence of inconsistent results from previous evaluations, and (2)
concerns surrounding the validity of existing evaluation methodologies. To
address these challenges, we present a novel framework for procedurally
generating evaluations with LLMs by populating causal templates. Using our
framework, we create a new social reasoning benchmark (BigToM) for LLMs which
consists of 25 controls and 5,000 model-written evaluations. We find that human
participants rate the quality of our benchmark higher than previous
crowd-sourced evaluations and comparable to expert-written evaluations. Using
BigToM, we evaluate the social reasoning capabilities of a variety of LLMs and
compare model performances with human performance. Our results suggest that
GPT4 has ToM capabilities that mirror human inference patterns, though less
reliable, while other LLMs struggle
Stream of Search (SoS): Learning to Search in Language
Language models are rarely shown fruitful mistakes while training. They then
struggle to look beyond the next token, suffering from a snowballing of errors
and struggling to predict the consequence of their actions several steps ahead.
In this paper, we show how language models can be taught to search by
representing the process of search in language, as a flattened string -- a
stream of search (SoS). We propose a unified language for search that captures
an array of different symbolic search strategies. We demonstrate our approach
using the simple yet difficult game of Countdown, where the goal is to combine
input numbers with arithmetic operations to reach a target number. We pretrain
a transformer-based language model from scratch on a dataset of streams of
search generated by heuristic solvers. We find that SoS pretraining increases
search accuracy by 25% over models trained to predict only the optimal search
trajectory. We further finetune this model with two policy improvement methods:
Advantage-Induced Policy Alignment (APA) and Self-Taught Reasoner (STaR). The
finetuned SoS models solve 36% of previously unsolved problems, including
problems that cannot be solved by any of the heuristic solvers. Our results
indicate that language models can learn to solve problems via search,
self-improve to flexibly use different search strategies, and potentially
discover new ones
Social Contract AI: Aligning AI Assistants with Implicit Group Norms
We explore the idea of aligning an AI assistant by inverting a model of
users' (unknown) preferences from observed interactions. To validate our
proposal, we run proof-of-concept simulations in the economic ultimatum game,
formalizing user preferences as policies that guide the actions of simulated
players. We find that the AI assistant accurately aligns its behavior to match
standard policies from the economic literature (e.g., selfish, altruistic).
However, the assistant's learned policies lack robustness and exhibit limited
generalization in an out-of-distribution setting when confronted with a
currency (e.g., grams of medicine) that was not included in the assistant's
training distribution. Additionally, we find that when there is inconsistency
in the relationship between language use and an unknown policy (e.g., an
altruistic policy combined with rude language), the assistant's learning of the
policy is slowed. Overall, our preliminary results suggest that developing
simulation frameworks in which AI assistants need to infer preferences from
diverse users can provide a valuable approach for studying practical alignment
questions.Comment: SoLaR NeurIPS 2023 Workshop (https://solar-neurips.github.io/
Instance-Level Semantic Maps for Vision Language Navigation
Humans have a natural ability to perform semantic associations with the
surrounding objects in the environment. This allows them to create a mental map
of the environment which helps them to navigate on-demand when given a
linguistic instruction. A natural goal in Vision Language Navigation (VLN)
research is to impart autonomous agents with similar capabilities. Recently
introduced VL Maps \cite{huang23vlmaps} take a step towards this goal by
creating a semantic spatial map representation of the environment without any
labelled data. However, their representations are limited for practical
applicability as they do not distinguish between different instances of the
same object. In this work, we address this limitation by integrating
instance-level information into spatial map representation using a community
detection algorithm and by utilizing word ontology learned by large language
models (LLMs) to perform open-set semantic associations in the mapping
representation. The resulting map representation improves the navigation
performance by two-fold (233\%) on realistic language commands with
instance-specific descriptions compared to VL Maps. We validate the
practicality and effectiveness of our approach through extensive qualitative
and quantitative experiments
Recommended from our members
Evaluating infants’ reasoning about agents using the Baby Intuitions Benchmark (BIB)
Young infants reason about the goals, preferences, and actions of others. State of the art computational models, however, still fail in such reasoning. The Baby Intuitions Benchmark (BIB) was designed to test agency reasoning in AI using an infant behavioral paradigm. While BIB’s presentation of simple animations makes it particularly suitable for testing AI, such vignettes have yet to be validated with infants.
In this pilot, 11-month-old infants watched two sets of animations from BIB, one on agents’ consistent preferences and the other on agents’ efficient actions. Infants looked longer towards violations in agents’ behavior in both the preference (N = 24, β = 3.24 p = .040) and efficiency task (N = 24, β = 4.50 p = .016).
These preliminary results suggest that infants’ agency reasoning is abstract enough to be elicited by simple animations and validate BIB as a test of agency reasoning for humans and AIs