3,871 research outputs found
Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks
We study the problem of synthesizing a number of likely future frames from a
single input image. In contrast to traditional methods, which have tackled this
problem in a deterministic or non-parametric way, we propose a novel approach
that models future frames in a probabilistic manner. Our probabilistic model
makes it possible for us to sample and synthesize many possible future frames
from a single input image. Future frame synthesis is challenging, as it
involves low- and high-level image and motion understanding. We propose a novel
network structure, namely a Cross Convolutional Network to aid in synthesizing
future frames; this network structure encodes image and motion information as
feature maps and convolutional kernels, respectively. In experiments, our model
performs well on synthetic data, such as 2D shapes and animated game sprites,
as well as on real-wold videos. We also show that our model can be applied to
tasks such as visual analogy-making, and present an analysis of the learned
network representations.Comment: The first two authors contributed equally to this wor
Physical Primitive Decomposition
Objects are made of parts, each with distinct geometry, physics,
functionality, and affordances. Developing such a distributed, physical,
interpretable representation of objects will facilitate intelligent agents to
better explore and interact with the world. In this paper, we study physical
primitive decomposition---understanding an object through its components, each
with physical and geometric attributes. As annotated data for object parts and
physics are rare, we propose a novel formulation that learns physical
primitives by explaining both an object's appearance and its behaviors in
physical events. Our model performs well on block towers and tools in both
synthetic and real scenarios; we also demonstrate that visual and physical
observations often provide complementary signals. We further present ablation
and behavioral studies to better understand our model and contrast it with
human performance.Comment: ECCV 2018. Project page: http://ppd.csail.mit.edu
Visual Dynamics: Stochastic Future Generation via Layered Cross Convolutional Networks
We study the problem of synthesizing a number of likely future frames from a
single input image. In contrast to traditional methods that have tackled this
problem in a deterministic or non-parametric way, we propose to model future
frames in a probabilistic manner. Our probabilistic model makes it possible for
us to sample and synthesize many possible future frames from a single input
image. To synthesize realistic movement of objects, we propose a novel network
structure, namely a Cross Convolutional Network; this network encodes image and
motion information as feature maps and convolutional kernels, respectively. In
experiments, our model performs well on synthetic data, such as 2D shapes and
animated game sprites, and on real-world video frames. We present analyses of
the learned network representations, showing it is implicitly learning a
compact encoding of object appearance and motion. We also demonstrate a few of
its applications, including visual analogy-making and video extrapolation.Comment: Journal preprint of arXiv:1607.02586 (IEEE TPAMI, 2019). The first
two authors contributed equally to this work. Project page:
http://visualdynamics.csail.mit.ed
Correctnes of belief propagation in Gaussian graphical models of arbitrary topology
Local "belief propagation" rules of the sort proposed byPearl [12] are guaranteed to converge to the correct posterior probabilities in singly connected graphical models. Recently, a number of researchers have empirically demonstrated good performance of "loopy belief propagation" -- using these same rules on graphs with loops. Perhaps the most dramatic instance is the near Shannonlimit performance of "Turbo codes", whose decoding algorithm is equivalentto loopy belief propagation. Except for th
Properties and Applications of Shape Recipes
In low-level vision, the representation of scene properties such as shape, albedo, etc., are very high dimensional as they have to describe complicated structures. The approach proposed here is to let the image itself bear as much of the representational burden as possible. In many situations, scene and image are closely related and it is possible to find a functional relationship between them. The scene information can be represented in reference to the image where the functional specifies how to translate the image into the associated scene. We illustrate the use of this representation for encoding shape information. We show how this representation has appealing properties such as locality and slow variation across space and scale. These properties provide a way of improving shape estimates coming from other sources of information like stereo
Shape-Time Photography
We introduce a new method to describe, in a single image, changes in shape over time. We acquire both range and image information with a stationary stereo camera. From the pictures taken, we display a composite image consisting of the image data from the surface closest to the camera at every pixel. This reveals the 3-d relationships over time by easy-to-interpret occlusion relationships in the composite image. We call the composite a shape-time photograph. Small errors in depth measurements cause artifacts in the shape-time images. We correct most of these using a Markov network to estimate the most probable front surface, taking into account the depth measurements, their uncertainties, and layer continuity assumptions
4D Frequency Analysis of Computational Cameras for Depth of Field Extension
Depth of field (DOF), the range of scene depths that appear sharp in a photograph, poses a fundamental tradeoff in photography---wide apertures are important to reduce imaging noise, but they also increase defocus blur. Recent advances in computational imaging modify the acquisition process to extend the DOF through deconvolution. Because deconvolution quality is a tight function of the frequency power spectrum of the defocus kernel, designs with high spectra are desirable. In this paper we study how to design effective extended-DOF systems, and show an upper bound on the maximal power spectrum that can be achieved. We analyze defocus kernels in the 4D light field space and show that in the frequency domain, only a low-dimensional 3D manifold contributes to focus. Thus, to maximize the defocus spectrum, imaging systems should concentrate their limited energy on this manifold. We review several computational imaging systems and show either that they spend energy outside the focal manifold or do not achieve a high spectrum over the DOF. Guided by this analysis we introduce the lattice-focal lens, which concentrates energy at the low-dimensional focal manifold and achieves a higher power spectrum than previous designs. We have built a prototype lattice-focal lens and present extended depth of field results
A Comparative Evaluation of Approximate Probabilistic Simulation and Deep Neural Networks as Accounts of Human Physical Scene Understanding
Humans demonstrate remarkable abilities to predict physical events in complex
scenes. Two classes of models for physical scene understanding have recently
been proposed: "Intuitive Physics Engines", or IPEs, which posit that people
make predictions by running approximate probabilistic simulations in causal
mental models similar in nature to video-game physics engines, and memory-based
models, which make judgments based on analogies to stored experiences of
previously encountered scenes and physical outcomes. Versions of the latter
have recently been instantiated in convolutional neural network (CNN)
architectures. Here we report four experiments that, to our knowledge, are the
first rigorous comparisons of simulation-based and CNN-based models, where both
approaches are concretely instantiated in algorithms that can run on raw image
inputs and produce as outputs physical judgments such as whether a stack of
blocks will fall. Both approaches can achieve super-human accuracy levels and
can quantitatively predict human judgments to a similar degree, but only the
simulation-based models generalize to novel situations in ways that people do,
and are qualitatively consistent with systematic perceptual illusions and
judgment asymmetries that people show.Comment: Accepted to CogSci 2016 as an oral presentatio
- …