49 research outputs found
Faces as a "Model Category" for Visual Object Recognition
Visual recognition is an important ability that is central to many everyday tasks such as reading, navigation and social interaction, and is therefore actively studied in neuroscience, cognitive psychology and artificial intelligence. There exist thousands of object categories, all of which pose similar challenges to biological and artificial visual systems: accurate recognition under varying location, scale, view angle, illumination and clutter. In many areas of science, important discoveries have been made using "model organisms" such as fruit flies, mice and macaques. For the thousands of object categories, the important and well-studied category of faces could potentially serve as a "model category" upon which efforts are focused, and from which fundamental insights are drawn. However, it has been hotly debated whether faces are processed by the brain in a manner fundamentally different from other categories. Here we show that "neural tuning size" -- a single parameter in a computational model of object processing -- is able to account for important face-specific phenomena. Thus, surprisingly, "face-like" processing is explainable by physiological mechanisms that differ only quantitatively from "object-like" processing. Our computational proof-of-principle provides specific neural tuning properties that correspond to the so-far qualitative and controversial notion of "holistic" face processing. Overall, faces may be a viable model category. Since faces are highly amenable to complementary experimental techniques like functional MRI, electrophysiology, electroencephalography and transcranial magnetic stimulation, this further raises the odds that the algorithms and neural circuits underlying visual recognition may first be solved for faces. With faces serving as a model category, the great scientific challenge of understanding and reverse-engineering general visual recognition can be greatly accelerated
Neural tuning size is a key factor underlying holistic face processing
Faces are a class of visual stimuli with unique significance, for a variety
of reasons. They are ubiquitous throughout the course of a person's life, and
face recognition is crucial for daily social interaction. Faces are also unlike
any other stimulus class in terms of certain physical stimulus characteristics.
Furthermore, faces have been empirically found to elicit certain characteristic
behavioral phenomena, which are widely held to be evidence of "holistic"
processing of faces. However, little is known about the neural mechanisms
underlying such holistic face processing. In other words, for the processing of
faces by the primate visual system, the input and output characteristics are
relatively well known, but the internal neural computations are not. The main
aim of this work is to further the fundamental understanding of what causes the
visual processing of faces to be different from that of objects. In this
computational modeling work, we show that a single factor - "neural tuning
size" - is able to account for three key phenomena that are characteristic of
face processing, namely the Composite Face Effect (CFE), Face Inversion Effect
(FIE) and Whole-Part Effect (WPE). Our computational proof-of-principle
provides specific neural tuning properties that correspond to the
poorly-understood notion of holistic face processing, and connects these neural
properties to psychophysical behavior. Overall, our work provides a unified and
parsimonious theoretical account for the disparate empirical data on
face-specific processing, deepening the fundamental understanding of face
processing
Neural Tuning Size in a Model of Primate Visual Processing Accounts for Three Key Markers of Holistic Face Processing
Faces are an important and unique class of visual stimuli, and have been of interest to neuroscientists for many years. Faces are known to elicit certain characteristic behavioral markers, collectively labeled “holistic processing”, while non-face objects are not processed holistically. However, little is known about the underlying neural mechanisms. The main aim of this computational simulation work is to investigate the neural mechanisms that make face processing holistic. Using a model of primate visual processing, we show that a single key factor, “neural tuning size”, is able to account for three important markers of holistic face processing: the Composite Face Effect (CFE), Face Inversion Effect (FIE) and Whole-Part Effect (WPE). Our proof-of-principle specifies the precise neurophysiological property that corresponds to the poorly-understood notion of holism, and shows that this one neural property controls three classic behavioral markers of holism. Our work is consistent with neurophysiological evidence, and makes further testable predictions. Overall, we provide a parsimonious account of holistic face processing, connecting computation, behavior and neurophysiology.National Science Foundation (U.S.) (STC award CCF-1231216
Advancing Perception in Artificial Intelligence through Principles of Cognitive Science
Although artificial intelligence (AI) has achieved many feats at a rapid
pace, there still exist open problems and fundamental shortcomings related to
performance and resource efficiency. Since AI researchers benchmark a
significant proportion of performance standards through human intelligence,
cognitive sciences-inspired AI is a promising domain of research. Studying
cognitive science can provide a fresh perspective to building fundamental
blocks in AI research, which can lead to improved performance and efficiency.
In this review paper, we focus on the cognitive functions of perception, which
is the process of taking signals from one's surroundings as input, and
processing them to understand the environment. Particularly, we study and
compare its various processes through the lens of both cognitive sciences and
AI. Through this study, we review all current major theories from various
sub-disciplines of cognitive science (specifically neuroscience, psychology and
linguistics), and draw parallels with theories and techniques from current
practices in AI. We, hence, present a detailed collection of methods in AI for
researchers to build AI systems inspired by cognitive science. Further, through
the process of reviewing the state of cognitive-inspired AI, we point out many
gaps in the current state of AI (with respect to the performance of the human
brain), and hence present potential directions for researchers to develop
better perception systems in AI.Comment: Summary: a detailed review of the current state of perception models
through the lens of cognitive A
STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning
Understanding relations between objects is crucial for understanding the
semantics of a visual scene. It is also an essential step in order to bridge
visual and language models. However, current state-of-the-art computer vision
models still lack the ability to perform spatial reasoning well. Existing
datasets mostly cover a relatively small number of spatial relations, all of
which are static relations that do not intrinsically involve motion. In this
paper, we propose the Spatial and Temporal Understanding of Prepositions
Dataset (STUPD) -- a large-scale video dataset for understanding static and
dynamic spatial relationships derived from prepositions of the English
language. The dataset contains 150K visual depictions (videos and images),
consisting of 30 distinct spatial prepositional senses, in the form of object
interaction simulations generated synthetically using Unity3D. In addition to
spatial relations, we also propose 50K visual depictions across 10 temporal
relations, consisting of videos depicting event/time-point interactions. To our
knowledge, no dataset exists that represents temporal relations through visual
settings. In this dataset, we also provide 3D information about object
interactions such as frame-wise coordinates, and descriptions of the objects
used. The goal of this synthetic dataset is to help models perform better in
visual relationship detection in real-world settings. We demonstrate an
increase in the performance of various models over 2 real-world datasets
(ImageNet-VidVRD and Spatial Senses) when pretrained on the STUPD dataset, in
comparison to other pretraining datasets.Comment: Submitted to Neurips Dataset track. 24 pages including citations and
appendi
Abductive Action Inference
Abductive reasoning aims to make the most likely inference for a given set of
incomplete observations. In this work, we propose a new task called abductive
action inference, in which given a situation, the model answers the question
`what actions were executed by the human in order to arrive in the current
state?'. Given a state, we investigate three abductive inference problems:
action set prediction, action sequence prediction, and abductive action
verification. We benchmark several SOTA models such as Transformers, Graph
neural networks, CLIP, BLIP, end-to-end trained Slow-Fast, and Resnet50-3D
models. Our newly proposed object-relational BiGED model outperforms all other
methods on this challenging task on the Action Genome dataset. Codes will be
made available.Comment: 16 pages, 9 figure
Towards a unified account of face (and maybe object) processing
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Brain and Cognitive Sciences, 2012.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (p. 191-197).Faces are an important class of visual stimuli, and are thought to be processed differently from objects by the human visual system. Going beyond the false dichotomy of same versus different processing, it is more important to understand how exactly faces are processed similarly or differently from objects. However, even by itself, face processing is poorly understood. Various aspects of face processing, such as holistic, configural, and face-space processing, are investigated in relative isolation, and the relationships between these are unclear. Furthermore, face processing is characteristically affected by various stimulus transformations such as inversion, contrast reversal and spatial frequency filtering, but how or why is unclear. Most importantly, we do not understand even the basic mechanisms of face processing. We hypothesize that what makes face processing distinctive is the existence of large, coarse face templates. We test our hypothesis by modifying an existing model of object processing to utilize such templates, and find that our model can account for many face-related phenomena. Using small, fine face templates as a control, we find that our model displays object-like processing characteristics instead. Overall, we believe that we may have made the first steps towards achieving a unified account of face processing. In addition, results from our control suggest that face and object processing share fundamental computational mechanisms. Coupled with recent advances in brain recording techniques, our results mean that face recognition could form the "tip of the spear" for attacking and solving the problem of visual recognition.by Cheston Y.-C. Tan.Ph.D