Intuitive physics is pivotal for human understanding of the physical world,
enabling prediction and interpretation of events even in infancy. Nonetheless,
replicating this level of intuitive physics in artificial intelligence (AI)
remains a formidable challenge. This study introduces X-VoE, a comprehensive
benchmark dataset, to assess AI agents' grasp of intuitive physics. Built on
the developmental psychology-rooted Violation of Expectation (VoE) paradigm,
X-VoE establishes a higher bar for the explanatory capacities of intuitive
physics models. Each VoE scenario within X-VoE encompasses three distinct
settings, probing models' comprehension of events and their underlying
explanations. Beyond model evaluation, we present an explanation-based learning
system that captures physics dynamics and infers occluded object states solely
from visual sequences, without explicit occlusion labels. Experimental outcomes
highlight our model's alignment with human commonsense when tested against
X-VoE. A remarkable feature is our model's ability to visually expound VoE
events by reconstructing concealed scenes. Concluding, we discuss the findings'
implications and outline future research directions. Through X-VoE, we catalyze
the advancement of AI endowed with human-like intuitive physics capabilities.Comment: 19 pages, 16 figures, selected for an Oral presentation at ICCV 2023.
Project link: https://pku.ai/publication/intuitive2023iccv