In video analysis, background models have many applications such as
background/foreground separation, change detection, anomaly detection,
tracking, and more. However, while learning such a model in a video captured by
a static camera is a fairly-solved task, in the case of a Moving-camera
Background Model (MCBM), the success has been far more modest due to
algorithmic and scalability challenges that arise due to the camera motion.
Thus, existing MCBMs are limited in their scope and their supported
camera-motion types. These hurdles also impeded the employment, in this
unsupervised task, of end-to-end solutions based on deep learning (DL).
Moreover, existing MCBMs usually model the background either on the domain of a
typically-large panoramic image or in an online fashion. Unfortunately, the
former creates several problems, including poor scalability, while the latter
prevents the recognition and leveraging of cases where the camera revisits
previously-seen parts of the scene. This paper proposes a new method, called
DeepMCBM, that eliminates all the aforementioned issues and achieves
state-of-the-art results. Concretely, first we identify the difficulties
associated with joint alignment of video frames in general and in a DL setting
in particular. Next, we propose a new strategy for joint alignment that lets us
use a spatial transformer net with neither a regularization nor any form of
specialized (and non-differentiable) initialization. Coupled with an
autoencoder conditioned on unwarped robust central moments (obtained from the
joint alignment), this yields an end-to-end regularization-free MCBM that
supports a broad range of camera motions and scales gracefully. We demonstrate
DeepMCBM's utility on a variety of videos, including ones beyond the scope of
other methods. Our code is available at https://github.com/BGU-CS-VIL/DeepMCBM .Comment: 26 paged, 5 figures. To be published in ECCV 202