In this work, we propose a technique to convert CNN models for semantic
segmentation of static images into CNNs for video data. We describe a warping
method that can be used to augment existing architectures with very little
extra computational cost. This module is called NetWarp and we demonstrate its
use for a range of network architectures. The main design principle is to use
optical flow of adjacent frames for warping internal network representations
across time. A key insight of this work is that fast optical flow methods can
be combined with many different CNN architectures for improved performance and
end-to-end training. Experiments validate that the proposed approach incurs
only little extra computational cost, while improving performance, when video
streams are available. We achieve new state-of-the-art results on the CamVid
and Cityscapes benchmark datasets and show consistent improvements over
different baseline networks. Our code and models will be available at
http://segmentation.is.tue.mpg.deComment: ICCV 201