12,616 research outputs found
LookinGood: Enhancing Performance Capture with Real-time Neural Re-Rendering
Motivated by augmented and virtual reality applications such as telepresence,
there has been a recent focus in real-time performance capture of humans under
motion. However, given the real-time constraint, these systems often suffer
from artifacts in geometry and texture such as holes and noise in the final
rendering, poor lighting, and low-resolution textures. We take the novel
approach to augment such real-time performance capture systems with a deep
architecture that takes a rendering from an arbitrary viewpoint, and jointly
performs completion, super resolution, and denoising of the imagery in
real-time. We call this approach neural (re-)rendering, and our live system
"LookinGood". Our deep architecture is trained to produce high resolution and
high quality images from a coarse rendering in real-time. First, we propose a
self-supervised training method that does not require manual ground-truth
annotation. We contribute a specialized reconstruction error that uses semantic
information to focus on relevant parts of the subject, e.g. the face. We also
introduce a salient reweighing scheme of the loss function that is able to
discard outliers. We specifically design the system for virtual and augmented
reality headsets where the consistency between the left and right eye plays a
crucial role in the final user experience. Finally, we generate temporally
stable results by explicitly minimizing the difference between two consecutive
frames. We tested the proposed system in two different scenarios: one involving
a single RGB-D sensor, and upper body reconstruction of an actor, the second
consisting of full body 360 degree capture. Through extensive experimentation,
we demonstrate how our system generalizes across unseen sequences and subjects.
The supplementary video is available at http://youtu.be/Md3tdAKoLGU.Comment: The supplementary video is available at: http://youtu.be/Md3tdAKoLGU
To be presented at SIGGRAPH Asia 201
Volumetric Capture of Humans with a Single RGBD Camera via Semi-Parametric Learning
Volumetric (4D) performance capture is fundamental for AR/VR content
generation. Whereas previous work in 4D performance capture has shown
impressive results in studio settings, the technology is still far from being
accessible to a typical consumer who, at best, might own a single RGBD sensor.
Thus, in this work, we propose a method to synthesize free viewpoint renderings
using a single RGBD camera. The key insight is to leverage previously seen
"calibration" images of a given user to extrapolate what should be rendered in
a novel viewpoint from the data available in the sensor. Given these past
observations from multiple viewpoints, and the current RGBD image from a fixed
view, we propose an end-to-end framework that fuses both these data sources to
generate novel renderings of the performer. We demonstrate that the method can
produce high fidelity images, and handle extreme changes in subject pose and
camera viewpoints. We also show that the system generalizes to performers not
seen in the training data. We run exhaustive experiments demonstrating the
effectiveness of the proposed semi-parametric model (i.e. calibration images
available to the neural network) compared to other state of the art machine
learned solutions. Further, we compare the method with more traditional
pipelines that employ multi-view capture. We show that our framework is able to
achieve compelling results, with substantially less infrastructure than
previously required
Neural Rerendering in the Wild
We explore total scene capture -- recording, modeling, and rerendering a
scene under varying appearance such as season and time of day. Starting from
internet photos of a tourist landmark, we apply traditional 3D reconstruction
to register the photos and approximate the scene as a point cloud. For each
photo, we render the scene points into a deep framebuffer, and train a neural
network to learn the mapping of these initial renderings to the actual photos.
This rerendering network also takes as input a latent appearance vector and a
semantic mask indicating the location of transient objects like pedestrians.
The model is evaluated on several datasets of publicly available images
spanning a broad range of illumination conditions. We create short videos
demonstrating realistic manipulation of the image viewpoint, appearance, and
semantic labeling. We also compare results with prior work on scene
reconstruction from internet photos.Comment: To be presented at CVPR 2019 (oral). Supplementary video available at
http://youtu.be/E1crWQn_km
Deep Learning for Single-View Instance Recognition
Deep learning methods have typically been trained on large datasets in which
many training examples are available. However, many real-world product datasets
have only a small number of images available for each product. We explore the
use of deep learning methods for recognizing object instances when we have only
a single training example per class. We show that feedforward neural networks
outperform state-of-the-art methods for recognizing objects from novel
viewpoints even when trained from just a single image per object. To further
improve our performance on this task, we propose to take advantage of a
supplementary dataset in which we observe a separate set of objects from
multiple viewpoints. We introduce a new approach for training deep learning
methods for instance recognition with limited training data, in which we use an
auxiliary multi-view dataset to train our network to be robust to viewpoint
changes. We find that this approach leads to a more robust classifier for
recognizing objects from novel viewpoints, outperforming previous
state-of-the-art approaches including keypoint-matching, template-based
techniques, and sparse coding.Comment: 16 pages, 15 figure
Lifting Object Detection Datasets into 3D
While data has certainly taken the center stage in computer vision in recent
years, it can still be difficult to obtain in certain scenarios. In particular,
acquiring ground truth 3D shapes of objects pictured in 2D images remains a
challenging feat and this has hampered progress in recognition-based object
reconstruction from a single image. Here we propose to bypass previous
solutions such as 3D scanning or manual design, that scale poorly, and instead
populate object category detection datasets semi-automatically with dense,
per-object 3D reconstructions, bootstrapped from:(i) class labels, (ii) ground
truth figure-ground segmentations and (iii) a small set of keypoint
annotations. Our proposed algorithm first estimates camera viewpoint using
rigid structure-from-motion and then reconstructs object shapes by optimizing
over visual hull proposals guided by loose within-class shape similarity
assumptions. The visual hull sampling process attempts to intersect an object's
projection cone with the cones of minimal subsets of other similar objects
among those pictured from certain vantage points. We show that our method is
able to produce convincing per-object 3D reconstructions and to accurately
estimate cameras viewpoints on one of the most challenging existing
object-category detection datasets, PASCAL VOC. We hope that our results will
re-stimulate interest on joint object recognition and 3D reconstruction from a
single image
Neural Allocentric Intuitive Physics Prediction from Real Videos
Humans are able to make rich predictions about the future dynamics of
physical objects from a glance. On the other hand, most existing computer
vision approaches require strong assumptions about the underlying system,
ad-hoc modeling, or annotated datasets, to carry out even simple predictions.
To tackle this gap, we propose a new perspective on the problem of learning
intuitive physics that is inspired by the spatial memory representation of
objects and spaces in human brains, in particular the co-existence of
egocentric and allocentric spatial representations. We present a generic
framework that learns a layered representation of the physical world, using a
cascade of invertible modules. In this framework, real images are first
converted to a synthetic domain representation that reduces complexity arising
from lighting and texture. Then, an allocentric viewpoint transformer removes
viewpoint complexity by projecting images to a canonical view. Finally, a novel
Recurrent Latent Variation Network (RLVN) architecture learns the dynamics of
the objects interacting with the environment and predicts future motion,
leveraging the availability of unlimited synthetic simulations. Predicted
frames are then projected back to the original camera view and translated back
to the real world domain. Experimental results show the ability of the
framework to consistently and accurately predict several frames in the future
and the ability to adapt to real images.Comment: Added references, minor changes. arXiv admin note: text overlap with
arXiv:1506.02025 by other author
Orientation Driven Bag of Appearances for Person Re-identification
Person re-identification (re-id) consists of associating individual across
camera network, which is valuable for intelligent video surveillance and has
drawn wide attention. Although person re-identification research is making
progress, it still faces some challenges such as varying poses, illumination
and viewpoints. For feature representation in re-identification, existing works
usually use low-level descriptors which do not take full advantage of body
structure information, resulting in low representation ability.
%discrimination. To solve this problem, this paper proposes the mid-level
body-structure based feature representation (BSFR) which introduces body
structure pyramid for codebook learning and feature pooling in the vertical
direction of human body. Besides, varying viewpoints in the horizontal
direction of human body usually causes the data missing problem, , the
appearances obtained in different orientations of the identical person could
vary significantly. To address this problem, the orientation driven bag of
appearances (ODBoA) is proposed to utilize person orientation information
extracted by orientation estimation technic. To properly evaluate the proposed
approach, we introduce a new re-identification dataset (Market-1203) based on
the Market-1501 dataset and propose a new re-identification dataset (PKU-Reid).
Both datasets contain multiple images captured in different body orientations
for each person. Experimental results on three public datasets and two proposed
datasets demonstrate the superiority of the proposed approach, indicating the
effectiveness of body structure and orientation information for improving
re-identification performance.Comment: 13 pages, 15 figures, 3 tables, submitted to IEEE Transactions on
Circuits and Systems for Video Technolog
Multi-View Surveillance Video Summarization via Joint Embedding and Sparse Optimization
Most traditional video summarization methods are designed to generate
effective summaries for single-view videos, and thus they cannot fully exploit
the complicated intra and inter-view correlations in summarizing multi-view
videos in a camera network. In this paper, with the aim of summarizing
multi-view videos, we introduce a novel unsupervised framework via joint
embedding and sparse representative selection. The objective function is
two-fold. The first is to capture the multi-view correlations via an embedding,
which helps in extracting a diverse set of representatives. The second is to
use a `2;1- norm to model the sparsity while selecting representative shots for
the summary. We propose to jointly optimize both of the objectives, such that
embedding can not only characterize the correlations, but also indicate the
requirements of sparse representative selection. We present an efficient
alternating algorithm based on half-quadratic minimization to solve the
proposed non-smooth and non-convex objective with convergence analysis. A key
advantage of the proposed approach with respect to the state-of-the-art is that
it can summarize multi-view videos without assuming any prior
correspondences/alignment between them, e.g., uncalibrated camera networks.
Rigorous experiments on several multi-view datasets demonstrate that our
approach clearly outperforms the state-of-the-art methods.Comment: IEEE Trans. on Multimedia, 2017 (In Press
Cooking in the kitchen: Recognizing and Segmenting Human Activities in Videos
As research on action recognition matures, the focus is shifting away from
categorizing basic task-oriented actions using hand-segmented video datasets to
understanding complex goal-oriented daily human activities in real-world
settings. Temporally structured models would seem obvious to tackle this set of
problems, but so far, cases where these models have outperformed simpler
unstructured bag-of-word types of models are scarce. With the increasing
availability of large human activity datasets, combined with the development of
novel feature coding techniques that yield more compact representations, it is
time to revisit structured generative approaches.
Here, we describe an end-to-end generative approach from the encoding of
features to the structural modeling of complex human activities by applying
Fisher vectors and temporal models for the analysis of video sequences.
We systematically evaluate the proposed approach on several available
datasets (ADL, MPIICooking, and Breakfast datasets) using a variety of
performance metrics. Through extensive system evaluations, we demonstrate that
combining compact video representations based on Fisher Vectors with HMM-based
modeling yields very significant gains in accuracy and when properly trained
with sufficient training samples, structured temporal models outperform
unstructured bag-of-word types of models by a large margin on the tested
performance metric.Comment: 15 pages, 12 figure
Cube Padding for Weakly-Supervised Saliency Prediction in 360{\deg} Videos
Automatic saliency prediction in 360{\deg} videos is critical for viewpoint
guidance applications (e.g., Facebook 360 Guide). We propose a spatial-temporal
network which is (1) weakly-supervised trained and (2) tailor-made for
360{\deg} viewing sphere. Note that most existing methods are less scalable
since they rely on annotated saliency map for training. Most importantly, they
convert 360{\deg} sphere to 2D images (e.g., a single equirectangular image or
multiple separate Normal Field-of-View (NFoV) images) which introduces
distortion and image boundaries. In contrast, we propose a simple and effective
Cube Padding (CP) technique as follows. Firstly, we render the 360{\deg} view
on six faces of a cube using perspective projection. Thus, it introduces very
little distortion. Then, we concatenate all six faces while utilizing the
connectivity between faces on the cube for image padding (i.e., Cube Padding)
in convolution, pooling, convolutional LSTM layers. In this way, CP introduces
no image boundary while being applicable to almost all Convolutional Neural
Network (CNN) structures. To evaluate our method, we propose Wild-360, a new
360{\deg} video saliency dataset, containing challenging videos with saliency
heatmap annotations. In experiments, our method outperforms baseline methods in
both speed and quality.Comment: CVPR 201
- …