Humans and animals have a rich and flexible understanding of the physical
world, which enables them to infer the underlying dynamical trajectories of
objects and events, plausible future states, and use that to plan and
anticipate the consequences of actions. However, the neural mechanisms
underlying these computations are unclear. We combine a goal-driven modeling
approach with dense neurophysiological data and high-throughput human
behavioral readouts to directly impinge on this question. Specifically, we
construct and evaluate several classes of sensory-cognitive networks to predict
the future state of rich, ethologically-relevant environments, ranging from
self-supervised end-to-end models with pixel-wise or object-centric objectives,
to models that future predict in the latent space of purely static image-based
or dynamic video-based pretrained foundation models. We find strong
differentiation across these model classes in their ability to predict neural
and behavioral data both within and across diverse environments. In particular,
we find that neural responses are currently best predicted by models trained to
predict the future state of their environment in the latent space of pretrained
foundation models optimized for dynamic scenes in a self-supervised manner.
Notably, models that future predict in the latent space of video foundation
models that are optimized to support a diverse range of sensorimotor tasks,
reasonably match both human behavioral error patterns and neural dynamics
across all environmental scenarios that we were able to test. Overall, these
findings suggest that the neural mechanisms and behaviors of primate mental
simulation are thus far most consistent with being optimized to future predict
on dynamic, reusable visual representations that are useful for embodied AI
more generally.Comment: 17 pages, 6 figure