2,316 research outputs found
GASP : Geometric Association with Surface Patches
A fundamental challenge to sensory processing tasks in perception and
robotics is the problem of obtaining data associations across views. We present
a robust solution for ascertaining potentially dense surface patch (superpixel)
associations, requiring just range information. Our approach involves
decomposition of a view into regularized surface patches. We represent them as
sequences expressing geometry invariantly over their superpixel neighborhoods,
as uniquely consistent partial orderings. We match these representations
through an optimal sequence comparison metric based on the Damerau-Levenshtein
distance - enabling robust association with quadratic complexity (in contrast
to hitherto employed joint matching formulations which are NP-complete). The
approach is able to perform under wide baselines, heavy rotations, partial
overlaps, significant occlusions and sensor noise.
The technique does not require any priors -- motion or otherwise, and does
not make restrictive assumptions on scene structure and sensor movement. It
does not require appearance -- is hence more widely applicable than appearance
reliant methods, and invulnerable to related ambiguities such as textureless or
aliased content. We present promising qualitative and quantitative results
under diverse settings, along with comparatives with popular approaches based
on range as well as RGB-D data.Comment: International Conference on 3D Vision, 201
Video object segmentation introducing depth and motion information
We present a method to estimate the relative depth between objects in scenes of video sequences. The information for the estimation of the relative depth is obtained from the overlapping produced between objects when there is relative motion as well as from motion coherence between neighbouring regions. A relaxation labelling algorithm is used to solve conflicts and assign every region to a depth level. The depth estimation is used in a segmentation scheme which uses grey level information to produce a first segmentation. Regions of this partition are merged on the basis of their depth level.Peer ReviewedPostprint (published version
Single-Shot Multi-Person 3D Pose Estimation From Monocular RGB
We propose a new single-shot method for multi-person 3D pose estimation in
general scenes from a monocular RGB camera. Our approach uses novel
occlusion-robust pose-maps (ORPM) which enable full body pose inference even
under strong partial occlusions by other people and objects in the scene. ORPM
outputs a fixed number of maps which encode the 3D joint locations of all
people in the scene. Body part associations allow us to infer 3D pose for an
arbitrary number of people without explicit bounding box prediction. To train
our approach we introduce MuCo-3DHP, the first large scale training data set
showing real images of sophisticated multi-person interactions and occlusions.
We synthesize a large corpus of multi-person images by compositing images of
individual people (with ground truth from mutli-view performance capture). We
evaluate our method on our new challenging 3D annotated multi-person test set
MuPoTs-3D where we achieve state-of-the-art performance. To further stimulate
research in multi-person 3D pose estimation, we will make our new datasets, and
associated code publicly available for research purposes.Comment: International Conference on 3D Vision (3DV), 201
Occlusion resistant learning of intuitive physics from videos
To reach human performance on complex tasks, a key ability for artificial
systems is to understand physical interactions between objects, and predict
future outcomes of a situation. This ability, often referred to as intuitive
physics, has recently received attention and several methods were proposed to
learn these physical rules from video sequences. Yet, most of these methods are
restricted to the case where no, or only limited, occlusions occur. In this
work we propose a probabilistic formulation of learning intuitive physics in 3D
scenes with significant inter-object occlusions. In our formulation, object
positions are modeled as latent variables enabling the reconstruction of the
scene. We then propose a series of approximations that make this problem
tractable. Object proposals are linked across frames using a combination of a
recurrent interaction network, modeling the physics in object space, and a
compositional renderer, modeling the way in which objects project onto pixel
space. We demonstrate significant improvements over state-of-the-art in the
intuitive physics benchmark of IntPhys. We apply our method to a second dataset
with increasing levels of occlusions, showing it realistically predicts
segmentation masks up to 30 frames in the future. Finally, we also show results
on predicting motion of objects in real videos
Learning to Extract a Video Sequence from a Single Motion-Blurred Image
We present a method to extract a video sequence from a single motion-blurred
image. Motion-blurred images are the result of an averaging process, where
instant frames are accumulated over time during the exposure of the sensor.
Unfortunately, reversing this process is nontrivial. Firstly, averaging
destroys the temporal ordering of the frames. Secondly, the recovery of a
single frame is a blind deconvolution task, which is highly ill-posed. We
present a deep learning scheme that gradually reconstructs a temporal ordering
by sequentially extracting pairs of frames. Our main contribution is to
introduce loss functions invariant to the temporal order. This lets a neural
network choose during training what frame to output among the possible
combinations. We also address the ill-posedness of deblurring by designing a
network with a large receptive field and implemented via resampling to achieve
a higher computational efficiency. Our proposed method can successfully
retrieve sharp image sequences from a single motion blurred image and can
generalize well on synthetic and real datasets captured with different cameras
The Video Mesh: A Data Structure for Image-based Video Editing
This paper introduces the video mesh, a data structure for representing video as 2.5D "paper cutouts." The video mesh allows interactive editing of moving objects and modeling of depth, which enables 3D effects and post-exposure camera control. The video mesh sparsely encodes optical flow as well as depth, and handles occlusion using local layering and alpha mattes. Motion is described by a sparse set of points tracked over time. Each point also stores a depth value. The video mesh is a triangulation over this point set and per-pixel information is obtained by interpolation. The user rotoscopes occluding contours and we introduce an algorithm to cut the video mesh along them. Object boundaries are refined with perpixel alpha values. The video mesh is at its core a set of texture mapped triangles, we leverage graphics hardware to enable interactive editing and rendering of a variety of effects. We demonstrate the effectiveness of our representation with a number of special effects including 3D viewpoint changes, object insertion, and depth-of-field manipulation
Pose Estimation and Segmentation of Multiple People in Stereoscopic Movies
International audienceWe describe a method to obtain a pixel-wise segmentation and pose estimation of multiple people in stereoscopic videos. This task involves challenges such as dealing with unconstrained stereoscopic video, non-stationary cameras, and complex indoor and outdoor dynamic scenes with multiple people. We cast the problem as a discrete labelling task involving multiple person labels, devise a suitable cost function, and optimize it efficiently. The contributions of our work are two-fold: First, we develop a segmentation model incorporating person detections and learnt articulated pose segmentation masks, as well as colour, motion, and stereo disparity cues. The model also explicitly represents depth ordering and occlusion. Second, we introduce a stereoscopic dataset with frames extracted from feature-length movies "StreetDance 3D" and "Pina". The dataset contains 587 annotated human poses, 1158 bounding box annotations and 686 pixel-wise segmentations of people. The dataset is composed of indoor and outdoor scenes depicting multiple people with frequent occlusions. We demonstrate results on our new challenging dataset, as well as on the H2view dataset from (Sheasby et al. ACCV 2012)
- …