299 research outputs found
Predicting Future Instance Segmentation by Forecasting Convolutional Features
Anticipating future events is an important prerequisite towards intelligent
behavior. Video forecasting has been studied as a proxy task towards this goal.
Recent work has shown that to predict semantic segmentation of future frames,
forecasting at the semantic level is more effective than forecasting RGB frames
and then segmenting these. In this paper we consider the more challenging
problem of future instance segmentation, which additionally segments out
individual objects. To deal with a varying number of output labels per image,
we develop a predictive model in the space of fixed-sized convolutional
features of the Mask R-CNN instance segmentation model. We apply the "detection
head'" of Mask R-CNN on the predicted features to produce the instance
segmentation of future frames. Experiments show that this approach
significantly improves over strong baselines based on optical flow and
repurposed instance segmentation architectures
Predicting Deeper into the Future of Semantic Segmentation
The ability to predict and therefore to anticipate the future is an important
attribute of intelligence. It is also of utmost importance in real-time
systems, e.g. in robotics or autonomous driving, which depend on visual scene
understanding for decision making. While prediction of the raw RGB pixel values
in future video frames has been studied in previous work, here we introduce the
novel task of predicting semantic segmentations of future frames. Given a
sequence of video frames, our goal is to predict segmentation maps of not yet
observed video frames that lie up to a second or further in the future. We
develop an autoregressive convolutional neural network that learns to
iteratively generate multiple frames. Our results on the Cityscapes dataset
show that directly predicting future segmentations is substantially better than
predicting and then segmenting future RGB frames. Prediction results up to half
a second in the future are visually convincing and are much more accurate than
those of a baseline based on warping semantic segmentations using optical flow.Comment: Accepted to ICCV 2017. Supplementary material available on the
authors' webpage
Forecasting Future Instance Segmentation with Learned Optical Flow and Warping
For an autonomous vehicle it is essential to observe the ongoing dynamics of
a scene and consequently predict imminent future scenarios to ensure safety to
itself and others. This can be done using different sensors and modalities. In
this paper we investigate the usage of optical flow for predicting future
semantic segmentations. To do so we propose a model that forecasts flow fields
autoregressively. Such predictions are then used to guide the inference of a
learned warping function that moves instance segmentations on to future frames.
Results on the Cityscapes dataset demonstrate the effectiveness of optical-flow
methods.Comment: Paper published as Poster at ICIAP2
FLODCAST: Flow and Depth Forecasting via Multimodal Recurrent Architectures
Forecasting motion and spatial positions of objects is of fundamental
importance, especially in safety-critical settings such as autonomous driving.
In this work, we address the issue by forecasting two different modalities that
carry complementary information, namely optical flow and depth. To this end we
propose FLODCAST a flow and depth forecasting model that leverages a multitask
recurrent architecture, trained to jointly forecast both modalities at once. We
stress the importance of training using flows and depth maps together,
demonstrating that both tasks improve when the model is informed of the other
modality. We train the proposed model to also perform predictions for several
timesteps in the future. This provides better supervision and leads to more
precise predictions, retaining the capability of the model to yield outputs
autoregressively for any future time horizon. We test our model on the
challenging Cityscapes dataset, obtaining state of the art results for both
flow and depth forecasting. Thanks to the high quality of the generated flows,
we also report benefits on the downstream task of segmentation forecasting,
injecting our predictions in a flow-based mask-warping framework.Comment: Submitted to Pattern Recognitio
Deep learning cardiac motion analysis for human survival prediction
Motion analysis is used in computer vision to understand the behaviour of
moving objects in sequences of images. Optimising the interpretation of dynamic
biological systems requires accurate and precise motion tracking as well as
efficient representations of high-dimensional motion trajectories so that these
can be used for prediction tasks. Here we use image sequences of the heart,
acquired using cardiac magnetic resonance imaging, to create time-resolved
three-dimensional segmentations using a fully convolutional network trained on
anatomical shape priors. This dense motion model formed the input to a
supervised denoising autoencoder (4Dsurvival), which is a hybrid network
consisting of an autoencoder that learns a task-specific latent code
representation trained on observed outcome data, yielding a latent
representation optimised for survival prediction. To handle right-censored
survival outcomes, our network used a Cox partial likelihood loss function. In
a study of 302 patients the predictive accuracy (quantified by Harrell's
C-index) was significantly higher (p < .0001) for our model C=0.73 (95 CI:
0.68 - 0.78) than the human benchmark of C=0.59 (95 CI: 0.53 - 0.65). This
work demonstrates how a complex computer vision task using high-dimensional
medical image data can efficiently predict human survival
Panoptic segmentation forecasting
Our goal is to forecast the near future given a set of recent observations. We think this ability to forecast, i.e., to anticipate, is integral for the success of autonomous agents which need not only passively analyze an observation but also must react to it in real-time. Importantly, accurate forecasting hinges upon the chosen scene decomposition. We think that superior forecasting can be achieved by decomposing a dynamic scene into individual 'things' and background 'stuff'. Background 'stuff' largely moves because of camera motion, while foreground 'things' move because of both camera and individual object motion. Following this decomposition, we introduce panoptic segmentation forecasting. Panoptic segmentation forecasting opens up a middle-ground between existing extremes, which either forecast instance trajectories or predict the appearance of future image frames. To address this task we develop a two-component model: one component learns the dynamics of the background stuff by anticipating odometry, the other one anticipates the dynamics of detected things. We establish a leaderboard for this novel task, and validate a state-of-the-art model that outperforms available baselines
Learning Features by Watching Objects Move
This paper presents a novel yet intuitive approach to unsupervised feature
learning. Inspired by the human visual system, we explore whether low-level
motion-based grouping cues can be used to learn an effective visual
representation. Specifically, we use unsupervised motion-based segmentation on
videos to obtain segments, which we use as 'pseudo ground truth' to train a
convolutional network to segment objects from a single frame. Given the
extensive evidence that motion plays a key role in the development of the human
visual system, we hope that this straightforward approach to unsupervised
learning will be more effective than cleverly designed 'pretext' tasks studied
in the literature. Indeed, our extensive experiments show that this is the
case. When used for transfer learning on object detection, our representation
significantly outperforms previous unsupervised approaches across multiple
settings, especially when training data for the target task is scarce.Comment: CVPR 201
A Review on Deep Learning Techniques for Video Prediction
The ability to predict, anticipate and reason about future outcomes is a key component of intelligent decision-making systems. In light of the success of deep learning in computer vision, deep-learning-based video prediction emerged as a promising research direction. Defined as a self-supervised learning task, video prediction represents a suitable framework for representation learning, as it demonstrated potential capabilities for extracting meaningful representations of the underlying patterns in natural videos. Motivated by the increasing interest in this task, we provide a review on the deep learning methods for prediction in video sequences. We firstly define the video prediction fundamentals, as well as mandatory background concepts and the most used datasets. Next, we carefully analyze existing video prediction models organized according to a proposed taxonomy, highlighting their contributions and their significance in the field. The summary of the datasets and methods is accompanied with experimental results that facilitate the assessment of the state of the art on a quantitative basis. The paper is summarized by drawing some general conclusions, identifying open research challenges and by pointing out future research directions.This work has been funded by the Spanish Government PID2019-104818RB-I00 grant for the MoDeaAS project, supported with Feder funds. This work has also been supported by two Spanish national grants for PhD studies, FPU17/00166, and ACIF/2018/197 respectively
- …