312 research outputs found
MetaReVision: Meta-Learning with Retrieval for Visually Grounded Compositional Concept Acquisition
Humans have the ability to learn novel compositional concepts by recalling
and generalizing primitive concepts acquired from past experiences. Inspired by
this observation, in this paper, we propose MetaReVision, a retrieval-enhanced
meta-learning model to address the visually grounded compositional concept
learning problem. The proposed MetaReVision consists of a retrieval module and
a meta-learning module which are designed to incorporate retrieved primitive
concepts as a supporting set to meta-train vision-anguage models for grounded
compositional concept recognition. Through meta-learning from episodes
constructed by the retriever, MetaReVision learns a generic compositional
representation that can be fast updated to recognize novel compositional
concepts. We create CompCOCO and CompFlickr to benchmark the grounded
compositional concept learning. Our experimental results show that MetaReVision
outperforms other competitive baselines and the retrieval module plays an
important role in this compositional learning process
GIPCOL: Graph-Injected Soft Prompting for Compositional Zero-Shot Learning
Pre-trained vision-language models (VLMs) have achieved promising success in
many fields, especially with prompt learning paradigm. In this work, we propose
GIP-COL (Graph-Injected Soft Prompting for COmpositional Learning) to better
explore the compositional zero-shot learning (CZSL) ability of VLMs within the
prompt-based learning framework. The soft prompt in GIPCOL is structured and
consists of the prefix learnable vectors, attribute label and object label. In
addition, the attribute and object labels in the soft prompt are designated as
nodes in a compositional graph. The compositional graph is constructed based on
the compositional structure of the objects and attributes extracted from the
training data and consequently feeds the updated concept representation into
the soft prompt to capture this compositional structure for a better prompting
for CZSL. With the new prompting strategy, GIPCOL achieves state-of-the-art AUC
results on all three CZSL benchmarks, including MIT-States, UT-Zappos, and
C-GQA datasets in both closed and open settings compared to previous non-CLIP
as well as CLIP-based methods. We analyze when and why GIPCOL operates well
given the CLIP backbone and its training data limitations, and our findings
shed light on designing more effective prompts for CZSLComment: WACV2
Towards A Holistic Landscape of Situated Theory of Mind in Large Language Models
Large Language Models (LLMs) have generated considerable interest and debate
regarding their potential emergence of Theory of Mind (ToM). Several recent
inquiries reveal a lack of robust ToM in these models and pose a pressing
demand to develop new benchmarks, as current ones primarily focus on different
aspects of ToM and are prone to shortcuts and data leakage. In this position
paper, we seek to answer two road-blocking questions: (1) How can we taxonomize
a holistic landscape of machine ToM? (2) What is a more effective evaluation
protocol for machine ToM? Following psychological studies, we taxonomize
machine ToM into 7 mental state categories and delineate existing benchmarks to
identify under-explored aspects of ToM. We argue for a holistic and situated
evaluation of ToM to break ToM into individual components and treat LLMs as an
agent who is physically situated in environments and socially situated in
interactions with humans. Such situated evaluation provides a more
comprehensive assessment of mental states and potentially mitigates the risk of
shortcuts and data leakage. We further present a pilot study in a grid world
setup as a proof of concept. We hope this position paper can facilitate future
research to integrate ToM with LLMs and offer an intuitive means for
researchers to better position their work in the landscape of ToM. Project
page: https://github.com/Mars-tin/awesome-theory-of-mindComment: Theme Track, Findings of EMNLP 202
NLP Reproducibility For All: Understanding Experiences of Beginners
As natural language processing (NLP) has recently seen an unprecedented level
of excitement, and more people are eager to enter the field, it is unclear
whether current research reproducibility efforts are sufficient for this group
of beginners to apply the latest developments. To understand their needs, we
conducted a study with 93 students in an introductory NLP course, where
students reproduced the results of recent NLP papers. Surprisingly, we find
that their programming skill and comprehension of research papers have a
limited impact on their effort spent completing the exercise. Instead, we find
accessibility efforts by research authors to be the key to success, including
complete documentation, better coding practice, and easier access to data
files. Going forward, we recommend that NLP researchers pay close attention to
these simple aspects of open-sourcing their work, and use insights from
beginners' feedback to provide actionable ideas on how to better support them.Comment: ACL 2023 Theme Trac
In-Context Analogical Reasoning with Pre-Trained Language Models
Analogical reasoning is a fundamental capacity of human cognition that allows
us to reason abstractly about novel situations by relating them to past
experiences. While it is thought to be essential for robust reasoning in AI
systems, conventional approaches require significant training and/or
hard-coding of domain knowledge to be applied to benchmark tasks. Inspired by
cognitive science research that has found connections between human language
and analogy-making, we explore the use of intuitive language-based abstractions
to support analogy in AI systems. Specifically, we apply large pre-trained
language models (PLMs) to visual Raven's Progressive Matrices (RPM), a common
relational reasoning test. By simply encoding the perceptual features of the
problem into language form, we find that PLMs exhibit a striking capacity for
zero-shot relational reasoning, exceeding human performance and nearing
supervised vision-based methods. We explore different encodings that vary the
level of abstraction over task features, finding that higher-level abstractions
further strengthen PLMs' analogical reasoning. Our detailed analysis reveals
insights on the role of model complexity, in-context learning, and prior
knowledge in solving RPM tasks
Efficient In-Context Learning in Vision-Language Models for Egocentric Videos
Recent advancements in text-only large language models (LLMs) have
highlighted the benefit of in-context learning for adapting to new tasks with a
few demonstrations. However, extending in-context learning to large
vision-language models (VLMs) using a huge amount of naturalistic
vision-language data has shown limited success, particularly for egocentric
videos, due to high data collection costs. We propose a novel training method
fficient n-context earning on
gocentric ideos (), which elicits
in-context learning in VLMs for egocentric videos without requiring massive,
naturalistic egocentric video datasets. involves architectural
and training data adaptations to allow the model to process contexts
interleaved with video clips and narrations, sampling of in-context examples
with clusters of similar verbs and nouns, use of data with skewed marginal
distributions with a long tail of infrequent verbs and nouns, as well as
homonyms and synonyms. Our evaluations show that -trained
models outperform larger VLMs trained on a huge amount of naturalistic data in
in-context learning. Furthermore, they can generalize to not only
out-of-distribution, but also novel, rare egocentric videos and texts via
in-context learning, demonstrating potential for applications requiring
cost-effective training, and rapid post-deployment adaptability. Our code and
demo are available at \url{https://github.com/yukw777/EILEV}.Comment: 10 pages, LaTeX; added acknowledgment
CycleNet: Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation
Diffusion models (DMs) have enabled breakthroughs in image synthesis tasks
but lack an intuitive interface for consistent image-to-image (I2I)
translation. Various methods have been explored to address this issue,
including mask-based methods, attention-based methods, and image-conditioning.
However, it remains a critical challenge to enable unpaired I2I translation
with pre-trained DMs while maintaining satisfying consistency. This paper
introduces Cyclenet, a novel but simple method that incorporates cycle
consistency into DMs to regularize image manipulation. We validate Cyclenet on
unpaired I2I tasks of different granularities. Besides the scene and object
level translation, we additionally contribute a multi-domain I2I translation
dataset to study the physical state changes of objects. Our empirical studies
show that Cyclenet is superior in translation consistency and quality, and can
generate high-quality images for out-of-domain distributions with a simple
change of the textual prompt. Cyclenet is a practical framework, which is
robust even with very limited training data (around 2k) and requires minimal
computational resources (1 GPU) to train. Project homepage:
https://cyclenetweb.github.io/Comment: NeurIPS 202
- …