3 research outputs found
Advancing Perception in Artificial Intelligence through Principles of Cognitive Science
Although artificial intelligence (AI) has achieved many feats at a rapid
pace, there still exist open problems and fundamental shortcomings related to
performance and resource efficiency. Since AI researchers benchmark a
significant proportion of performance standards through human intelligence,
cognitive sciences-inspired AI is a promising domain of research. Studying
cognitive science can provide a fresh perspective to building fundamental
blocks in AI research, which can lead to improved performance and efficiency.
In this review paper, we focus on the cognitive functions of perception, which
is the process of taking signals from one's surroundings as input, and
processing them to understand the environment. Particularly, we study and
compare its various processes through the lens of both cognitive sciences and
AI. Through this study, we review all current major theories from various
sub-disciplines of cognitive science (specifically neuroscience, psychology and
linguistics), and draw parallels with theories and techniques from current
practices in AI. We, hence, present a detailed collection of methods in AI for
researchers to build AI systems inspired by cognitive science. Further, through
the process of reviewing the state of cognitive-inspired AI, we point out many
gaps in the current state of AI (with respect to the performance of the human
brain), and hence present potential directions for researchers to develop
better perception systems in AI.Comment: Summary: a detailed review of the current state of perception models
through the lens of cognitive A
STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning
Understanding relations between objects is crucial for understanding the
semantics of a visual scene. It is also an essential step in order to bridge
visual and language models. However, current state-of-the-art computer vision
models still lack the ability to perform spatial reasoning well. Existing
datasets mostly cover a relatively small number of spatial relations, all of
which are static relations that do not intrinsically involve motion. In this
paper, we propose the Spatial and Temporal Understanding of Prepositions
Dataset (STUPD) -- a large-scale video dataset for understanding static and
dynamic spatial relationships derived from prepositions of the English
language. The dataset contains 150K visual depictions (videos and images),
consisting of 30 distinct spatial prepositional senses, in the form of object
interaction simulations generated synthetically using Unity3D. In addition to
spatial relations, we also propose 50K visual depictions across 10 temporal
relations, consisting of videos depicting event/time-point interactions. To our
knowledge, no dataset exists that represents temporal relations through visual
settings. In this dataset, we also provide 3D information about object
interactions such as frame-wise coordinates, and descriptions of the objects
used. The goal of this synthetic dataset is to help models perform better in
visual relationship detection in real-world settings. We demonstrate an
increase in the performance of various models over 2 real-world datasets
(ImageNet-VidVRD and Spatial Senses) when pretrained on the STUPD dataset, in
comparison to other pretraining datasets.Comment: Submitted to Neurips Dataset track. 24 pages including citations and
appendi
WTM: Weighted Temporal Attention Module for Group Activity Recognition
Group Activity Recognition requires spatiotemporal modeling of an exponential number of semantic
and geometric relations among various individuals in a
scene. Previous attempts model these relations by aggregating independently derived spatial and temporal features. This increases the modeling complexity and results in sparse information due to lack of feature correlation. In this paper, we propose Weighted Temporal Attention Mechanism (WTM), a representational mechanism that combines spatial and temporal
features of a local subset of a visual sequence into a single 2D image representation, highlighting areas of a frame where actor motion is significant. Pairwise dense optical flow maps representing the temporal characteristic of individuals over a sequence are used as attention masks over raw RGB images through a multi-layer weighted aggregation. We demonstrate a strong correlation between spatial and temporal features, which
helps localize actions effectively in a multi-person scenario. The simplicity of the input representation allows the model to be trained by 2D image classification architectures in a plug-and-play fashion, which outperforms its multi-stream and multi-dimensional counterparts. The proposed method achieves the lowest computational complexity in comparison to other works. We demonstrate the performance of WTM on two widely used public benchmark datasets, namely the Collective
Activity Dataset (CAD) and the Volleyball Dataset. and achieve state-of-the-art accuracies of 95.1% and 94.6% respectively. We also discuss the application of this method to other datasets and general scenarios. The code is being made publicly available