15,969 research outputs found
Object-Centric Video Prediction via Decoupling of Object Dynamics and Interactions
We propose a novel framework for the task of object-centric video prediction,
i.e., extracting the compositional structure of a video sequence, as well as
modeling objects dynamics and interactions from visual observations in order to
predict the future object states, from which we can then generate subsequent
video frames. With the goal of learning meaningful spatio-temporal object
representations and accurately forecasting object states, we propose two novel
object-centric video predictor (OCVP) transformer modules, which decouple the
processing of temporal dynamics and object interactions, thus presenting an
improved prediction performance. In our experiments, we show how our
object-centric prediction framework utilizing our OCVP predictors outperforms
object-agnostic video prediction models on two different datasets, while
maintaining consistent and accurate object representations.Comment: Accepted for publication at IEEE International Conference on Image
Processing (ICIP) 202
Occlusion resistant learning of intuitive physics from videos
To reach human performance on complex tasks, a key ability for artificial
systems is to understand physical interactions between objects, and predict
future outcomes of a situation. This ability, often referred to as intuitive
physics, has recently received attention and several methods were proposed to
learn these physical rules from video sequences. Yet, most of these methods are
restricted to the case where no, or only limited, occlusions occur. In this
work we propose a probabilistic formulation of learning intuitive physics in 3D
scenes with significant inter-object occlusions. In our formulation, object
positions are modeled as latent variables enabling the reconstruction of the
scene. We then propose a series of approximations that make this problem
tractable. Object proposals are linked across frames using a combination of a
recurrent interaction network, modeling the physics in object space, and a
compositional renderer, modeling the way in which objects project onto pixel
space. We demonstrate significant improvements over state-of-the-art in the
intuitive physics benchmark of IntPhys. We apply our method to a second dataset
with increasing levels of occlusions, showing it realistically predicts
segmentation masks up to 30 frames in the future. Finally, we also show results
on predicting motion of objects in real videos
Learning Depth from Monocular Videos using Direct Methods
The ability to predict depth from a single image - using recent advances in
CNNs - is of increasing interest to the vision community. Unsupervised
strategies to learning are particularly appealing as they can utilize much
larger and varied monocular video datasets during learning without the need for
ground truth depth or stereo. In previous works, separate pose and depth CNN
predictors had to be determined such that their joint outputs minimized the
photometric error. Inspired by recent advances in direct visual odometry (DVO),
we argue that the depth CNN predictor can be learned without a pose CNN
predictor. Further, we demonstrate empirically that incorporation of a
differentiable implementation of DVO, along with a novel depth normalization
strategy - substantially improves performance over state of the art that use
monocular videos for training
- …