16,190 research outputs found
High Resolution 3D Shape Texture from Multiple Videos
International audienceWe examine the problem of retrieving high resolution textures of objects observed in multiple videos under small object deformations. In the monocular case, the data redundancy necessary to reconstruct a high-resolution image stems from temporal accumulation. This has been vastly explored and is known as super-resolution. On the other hand, a handful of methods have considered the texture of a static 3D object observed from several cameras, where the data redundancy is obtained through the different viewpoints. We introduce a unified framework to leverage both possibilities for the estimation of a high resolution texture of an object. This framework uniformly deals with any related geometric variability introduced by the acquisition chain or by the evolution over time. To this goal we use 2D warps for all viewpoints and all temporal frames and a linear projection model from texture to image space. Despite its simplicity, the method is able to successfully handle different views over space and time. As shown experimentally, it demonstrates the interest of temporal information that improves the texture quality. Additionally, we also show that our method outperforms state of the art multi-view super-resolution methods that exist for the static case
3D Face Tracking and Texture Fusion in the Wild
We present a fully automatic approach to real-time 3D face reconstruction
from monocular in-the-wild videos. With the use of a cascaded-regressor based
face tracking and a 3D Morphable Face Model shape fitting, we obtain a
semi-dense 3D face shape. We further use the texture information from multiple
frames to build a holistic 3D face representation from the video frames. Our
system is able to capture facial expressions and does not require any
person-specific training. We demonstrate the robustness of our approach on the
challenging 300 Videos in the Wild (300-VW) dataset. Our real-time fitting
framework is available as an open source library at http://4dface.org
Learning from Synthetic Humans
Estimating human pose, shape, and motion from images and videos are
fundamental challenges with many applications. Recent advances in 2D human pose
estimation use large amounts of manually-labeled training data for learning
convolutional neural networks (CNNs). Such data is time consuming to acquire
and difficult to extend. Moreover, manual labeling of 3D pose, depth and motion
is impractical. In this work we present SURREAL (Synthetic hUmans foR REAL
tasks): a new large-scale dataset with synthetically-generated but realistic
images of people rendered from 3D sequences of human motion capture data. We
generate more than 6 million frames together with ground truth pose, depth
maps, and segmentation masks. We show that CNNs trained on our synthetic
dataset allow for accurate human depth estimation and human part segmentation
in real RGB images. Our results and the new dataset open up new possibilities
for advancing person analysis using cheap and large-scale synthetic data.Comment: Appears in: 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR 2017). 9 page
PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes
Estimating the 6D pose of known objects is important for robots to interact
with the real world. The problem is challenging due to the variety of objects
as well as the complexity of a scene caused by clutter and occlusions between
objects. In this work, we introduce PoseCNN, a new Convolutional Neural Network
for 6D object pose estimation. PoseCNN estimates the 3D translation of an
object by localizing its center in the image and predicting its distance from
the camera. The 3D rotation of the object is estimated by regressing to a
quaternion representation. We also introduce a novel loss function that enables
PoseCNN to handle symmetric objects. In addition, we contribute a large scale
video dataset for 6D object pose estimation named the YCB-Video dataset. Our
dataset provides accurate 6D poses of 21 objects from the YCB dataset observed
in 92 videos with 133,827 frames. We conduct extensive experiments on our
YCB-Video dataset and the OccludedLINEMOD dataset to show that PoseCNN is
highly robust to occlusions, can handle symmetric objects, and provide accurate
pose estimation using only color images as input. When using depth data to
further refine the poses, our approach achieves state-of-the-art results on the
challenging OccludedLINEMOD dataset. Our code and dataset are available at
https://rse-lab.cs.washington.edu/projects/posecnn/.Comment: Accepted to RSS 201
Rate-Accuracy Trade-Off In Video Classification With Deep Convolutional Neural Networks
Advanced video classification systems decode video frames to derive the
necessary texture and motion representations for ingestion and analysis by
spatio-temporal deep convolutional neural networks (CNNs). However, when
considering visual Internet-of-Things applications, surveillance systems and
semantic crawlers of large video repositories, the video capture and the
CNN-based semantic analysis parts do not tend to be co-located. This
necessitates the transport of compressed video over networks and incurs
significant overhead in bandwidth and energy consumption, thereby significantly
undermining the deployment potential of such systems. In this paper, we
investigate the trade-off between the encoding bitrate and the achievable
accuracy of CNN-based video classification models that directly ingest
AVC/H.264 and HEVC encoded videos. Instead of retaining entire compressed video
bitstreams and applying complex optical flow calculations prior to CNN
processing, we only retain motion vector and select texture information at
significantly-reduced bitrates and apply no additional processing prior to CNN
ingestion. Based on three CNN architectures and two action recognition
datasets, we achieve 11%-94% saving in bitrate with marginal effect on
classification accuracy. A model-based selection between multiple CNNs
increases these savings further, to the point where, if up to 7% loss of
accuracy can be tolerated, video classification can take place with as little
as 3 kbps for the transport of the required compressed video information to the
system implementing the CNN models
- …