380 research outputs found
Dynamic reorganization of the middle fusiform gyrus: long-term bird expertise predicts decreased face selectivity
What is the functional relationship between face-selective and expertise-predicated object-selective regions in the human middle fusiform gyrus? In two separate fMRI experiments, superior behaviorally-measured bird expertise predicts both higher middle fusiform gyrus selectivity for birds and, concomitantly, lower selectivity for faces. This finding suggests a long-term dynamic reorganization of the neural mechanisms underlying the visual recognition of faces and non-face
Quantifying the Roles of Visual, Linguistic, and Visual-Linguistic Complexity in Verb Acquisition
Children typically learn the meanings of nouns earlier than the meanings of
verbs. However, it is unclear whether this asymmetry is a result of complexity
in the visual structure of categories in the world to which language refers,
the structure of language itself, or the interplay between the two sources of
information. We quantitatively test these three hypotheses regarding early verb
learning by employing visual and linguistic representations of words sourced
from large-scale pre-trained artificial neural networks. Examining the
structure of both visual and linguistic embedding spaces, we find, first, that
the representation of verbs is generally more variable and less discriminable
within domain than the representation of nouns. Second, we find that if only
one learning instance per category is available, visual and linguistic
representations are less well aligned in the verb system than in the noun
system. However, in parallel with the course of human language development, if
multiple learning instances per category are available, visual and linguistic
representations become almost as well aligned in the verb system as in the noun
system. Third, we compare the relative contributions of factors that may
predict learning difficulty for individual words. A regression analysis reveals
that visual variability is the strongest factor that internally drives verb
learning, followed by visual-linguistic alignment and linguistic variability.
Based on these results, we conclude that verb acquisition is influenced by all
three sources of complexity, but that the variability of visual structure poses
the most significant challenge for verb learning
How are Three-Deminsional Objects Represented in the Brain?
We discuss a variety of object recognition experiments in which human subjects were presented with realistically rendered images of computer-generated three-dimensional objects, with tight control over stimulus shape, surface properties, illumination, and viewpoint, as well as subjects' prior exposure to the stimulus objects. In all experiments recognition performance was: (1) consistently viewpoint dependent; (2) only partially aided by binocular stereo and other depth information, (3) specific to viewpoints that were familiar; (4) systematically disrupted by rotation in depth more than by deforming the two-dimensional images of the stimuli. These results are consistent with recently advanced computational theories of recognition based on view interpolation
Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models
Pre-trained and frozen large language models (LLMs) can effectively map
simple scene rearrangement instructions to programs over a robot's visuomotor
functions through appropriate few-shot example prompting. To parse open-domain
natural language and adapt to a user's idiosyncratic procedures, not known
during prompt engineering time, fixed prompts fall short. In this paper, we
introduce HELPER, an embodied agent equipped with an external memory of
language-program pairs that parses free-form human-robot dialogue into action
programs through retrieval-augmented LLM prompting: relevant memories are
retrieved based on the current dialogue, instruction, correction, or VLM
description, and used as in-context prompt examples for LLM querying. The
memory is expanded during deployment to include pairs of user's language and
action plans, to assist future inferences and personalize them to the user's
language and routines. HELPER sets a new state-of-the-art in the TEACh
benchmark in both Execution from Dialog History (EDH) and Trajectory from
Dialogue (TfD), with a 1.7x improvement over the previous state-of-the-art for
TfD. Our models, code, and video results can be found in our project's website:
https://helper-agent-llm.github.io.Comment: Project page with code & videos: https://helper-agent-llm.github.i
Micro-Valences: Perceiving Affective Valence in Everyday Objects
Perceiving the affective valence of objects influences how we think about and react to the world around us. Conversely, the speed and quality with which we visually recognize objects in a visual scene can vary dramatically depending on that scene’s affective content. Although typical visual scenes contain mostly “everyday” objects, the affect perception in visual objects has been studied using somewhat atypical stimuli with strong affective valences (e.g., guns or roses). Here we explore whether affective valence must be strong or overt to exert an effect on our visual perception. We conclude that everyday objects carry subtle affective valences – “micro-valences” – which are intrinsic to their perceptual representation
HELPER-X: A Unified Instructable Embodied Agent to Tackle Four Interactive Vision-Language Domains with Memory-Augmented Language Models
Recent research on instructable agents has used memory-augmented Large
Language Models (LLMs) as task planners, a technique that retrieves
language-program examples relevant to the input instruction and uses them as
in-context examples in the LLM prompt to improve the performance of the LLM in
inferring the correct action and task plans. In this technical report, we
extend the capabilities of HELPER, by expanding its memory with a wider array
of examples and prompts, and by integrating additional APIs for asking
questions. This simple expansion of HELPER into a shared memory enables the
agent to work across the domains of executing plans from dialogue, natural
language instruction following, active question asking, and commonsense room
reorganization. We evaluate the agent on four diverse interactive
visual-language embodied agent benchmarks: ALFRED, TEACh, DialFRED, and the
Tidy Task. HELPER-X achieves few-shot, state-of-the-art performance across
these benchmarks using a single agent, without requiring in-domain training,
and remains competitive with agents that have undergone in-domain training.Comment: Videos and code https://helper-agent-llm.github.io
Brain Diffusion for Visual Exploration: Cortical Discovery using Large Scale Generative Models
A long standing goal in neuroscience has been to elucidate the functional
organization of the brain. Within higher visual cortex, functional accounts
have remained relatively coarse, focusing on regions of interest (ROIs) and
taking the form of selectivity for broad categories such as faces, places,
bodies, food, or words. Because the identification of such ROIs has typically
relied on manually assembled stimulus sets consisting of isolated objects in
non-ecological contexts, exploring functional organization without robust a
priori hypotheses has been challenging. To overcome these limitations, we
introduce a data-driven approach in which we synthesize images predicted to
activate a given brain region using paired natural images and fMRI recordings,
bypassing the need for category-specific stimuli. Our approach -- Brain
Diffusion for Visual Exploration ("BrainDiVE") -- builds on recent generative
methods by combining large-scale diffusion models with brain-guided image
synthesis. Validating our method, we demonstrate the ability to synthesize
preferred images with appropriate semantic specificity for well-characterized
category-selective ROIs. We then show that BrainDiVE can characterize
differences between ROIs selective for the same high-level category. Finally we
identify novel functional subdivisions within these ROIs, validated with
behavioral data. These results advance our understanding of the fine-grained
functional organization of human visual cortex, and provide well-specified
constraints for further examination of cortical organization using
hypothesis-driven methods.Comment: NeurIPS 2023 (Oral). Project page:
https://www.cs.cmu.edu/~afluo/BrainDiVE
BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity
Understanding the functional organization of higher visual cortex is a
central focus in neuroscience. Past studies have primarily mapped the visual
and semantic selectivity of neural populations using hand-selected stimuli,
which may potentially bias results towards pre-existing hypotheses of visual
cortex functionality. Moving beyond conventional approaches, we introduce a
data-driven method that generates natural language descriptions for images
predicted to maximally activate individual voxels of interest. Our method --
Semantic Captioning Using Brain Alignments ("BrainSCUBA") -- builds upon the
rich embedding space learned by a contrastive vision-language model and
utilizes a pre-trained large language model to generate interpretable captions.
We validate our method through fine-grained voxel-level captioning across
higher-order visual regions. We further perform text-conditioned image
synthesis with the captions, and show that our images are semantically coherent
and yield high predicted activations. Finally, to demonstrate how our method
enables scientific discovery, we perform exploratory investigations on the
distribution of "person" representations in the brain, and discover
fine-grained semantic selectivity in body-selective areas. Unlike earlier
studies that decode text, our method derives voxel-wise captions of semantic
selectivity. Our results show that BrainSCUBA is a promising means for
understanding functional preferences in the brain, and provides motivation for
further hypothesis-driven investigation of visual cortex
- …