409 research outputs found
Point-wise mutual information-based video segmentation with high temporal consistency
In this paper, we tackle the problem of temporally consistent boundary
detection and hierarchical segmentation in videos. While finding the best
high-level reasoning of region assignments in videos is the focus of much
recent research, temporal consistency in boundary detection has so far only
rarely been tackled. We argue that temporally consistent boundaries are a key
component to temporally consistent region assignment. The proposed method is
based on the point-wise mutual information (PMI) of spatio-temporal voxels.
Temporal consistency is established by an evaluation of PMI-based point
affinities in the spectral domain over space and time. Thus, the proposed
method is independent of any optical flow computation or previously learned
motion models. The proposed low-level video segmentation method outperforms the
learning-based state of the art in terms of standard region metrics
CASENet: Deep Category-Aware Semantic Edge Detection
Boundary and edge cues are highly beneficial in improving a wide variety of
vision tasks such as semantic segmentation, object recognition, stereo, and
object proposal generation. Recently, the problem of edge detection has been
revisited and significant progress has been made with deep learning. While
classical edge detection is a challenging binary problem in itself, the
category-aware semantic edge detection by nature is an even more challenging
multi-label problem. We model the problem such that each edge pixel can be
associated with more than one class as they appear in contours or junctions
belonging to two or more semantic classes. To this end, we propose a novel
end-to-end deep semantic edge learning architecture based on ResNet and a new
skip-layer architecture where category-wise edge activations at the top
convolution layer share and are fused with the same set of bottom layer
features. We then propose a multi-label loss function to supervise the fused
activations. We show that our proposed architecture benefits this problem with
better performance, and we outperform the current state-of-the-art semantic
edge detection methods by a large margin on standard data sets such as SBD and
Cityscapes.Comment: Accepted to CVPR 201
Virtual Occlusions Through Implicit Depth
For augmented reality (AR), it is important that virtual assets appear to 'sit among' real world objects. The virtual element should variously occlude and be occluded by real matter, based on a plausible depth ordering. This occlusion should be consistent over time as the viewer's camera moves. Unfortunately, small mistakes in the estimated scene depth can ruin the downstream occlusion mask, and thereby the AR illusion. Especially in real-time settings, depths inferred near boundaries or across time can be inconsistent. In this paper, we challenge the need for depth-regression as an intermediate step. We instead propose an implicit model for depth and use that to predict the occlusion mask directly. The inputs to our network are one or more color images, plus the known depths of any virtual geometry. We show how our occlusion predictions are more accurate and more temporally stable than predictions derived from traditional depth-estimation models. We obtain state-of-the-art occlusion results on the challenging ScanNetv2 dataset and superior qualitative results on real scenes
Real-time factored ConvNets: Extracting the x factor in human parsing
© 2017. The copyright of this document resides with its authors. We propose a real-time and lightweight multi-task style ConvNet (termed a Factored ConvNet) for human body parsing in images or video. Factored ConvNets have isolated areas which perform known sub-tasks, such as object localization or edge detection. We call this area and sub-task pair an X factor. Unlike multi-task ConvNets which have independent tasks, the Factored ConvNet’s sub-task has direct effect on the main task outcome. In this paper we show how to isolate the X factor of foreground/background (f/b) subtraction from the main task of segmenting human body images into 31 different body part types. Knowledge of this X factor leads to a number of benefits for the Factored ConvNet: 1) Ease of network transfer to other image domains, 2) ability to personalize to humans in video and 3) easy model performance boosts. All achieved by either efficient network update or replacement of the X factor whilst avoiding catastrophic forgetting of previously learnt body part dependencies and structure. We show these benefits on a large dataset of images and also on YouTube videos.SeeQuesto
- …