Existing methods to recognize actions in static images take the images at
their face value, learning the appearances---objects, scenes, and body
poses---that distinguish each action class. However, such models are deprived
of the rich dynamic structure and motions that also define human activity. We
propose an approach that hallucinates the unobserved future motion implied by a
single snapshot to help static-image action recognition. The key idea is to
learn a prior over short-term dynamics from thousands of unlabeled videos,
infer the anticipated optical flow on novel static images, and then train
discriminative models that exploit both streams of information. Our main
contributions are twofold. First, we devise an encoder-decoder convolutional
neural network and a novel optical flow encoding that can translate a static
image into an accurate flow map. Second, we show the power of hallucinated flow
for recognition, successfully transferring the learned motion into a standard
two-stream network for activity recognition. On seven datasets, we demonstrate
the power of the approach. It not only achieves state-of-the-art accuracy for
dense optical flow prediction, but also consistently enhances recognition of
actions and dynamic scenes.Comment: Published in CVPR 2018, project page:
http://vision.cs.utexas.edu/projects/im2flow