Recent times have witnessed an increasing number of applications of deep
neural networks towards solving tasks that require superior cognitive
abilities, e.g., playing Go, generating art, ChatGPT, etc. Such a dramatic
progress raises the question: how generalizable are neural networks in solving
problems that demand broad skills? To answer this question, we propose SMART: a
Simple Multimodal Algorithmic Reasoning Task and the associated SMART-101
dataset, for evaluating the abstraction, deduction, and generalization
abilities of neural networks in solving visuo-linguistic puzzles designed
specifically for children in the 6--8 age group. Our dataset consists of 101
unique puzzles; each puzzle comprises a picture and a question, and their
solution needs a mix of several elementary skills, including arithmetic,
algebra, and spatial reasoning, among others. To scale our dataset towards
training deep neural networks, we programmatically generate entirely new
instances for each puzzle, while retaining their solution algorithm. To
benchmark performances on SMART-101, we propose a vision and language
meta-learning model using varied state-of-the-art backbones. Our experiments
reveal that while powerful deep models offer reasonable performances on puzzles
in a supervised setting, they are not better than random accuracy when analyzed
for generalization. We also evaluate the recent ChatGPT and other large
language models on a part of SMART-101 and find that while these models show
convincing reasoning abilities, the answers are often incorrect.Comment: Accepted at CVPR 2023. For the SMART-101 dataset, see
https://doi.org/10.5281/zenodo.776179