34,939 research outputs found
Learning sparse representations of depth
This paper introduces a new method for learning and inferring sparse
representations of depth (disparity) maps. The proposed algorithm relaxes the
usual assumption of the stationary noise model in sparse coding. This enables
learning from data corrupted with spatially varying noise or uncertainty,
typically obtained by laser range scanners or structured light depth cameras.
Sparse representations are learned from the Middlebury database disparity maps
and then exploited in a two-layer graphical model for inferring depth from
stereo, by including a sparsity prior on the learned features. Since they
capture higher-order dependencies in the depth structure, these priors can
complement smoothness priors commonly used in depth inference based on Markov
Random Field (MRF) models. Inference on the proposed graph is achieved using an
alternating iterative optimization technique, where the first layer is solved
using an existing MRF-based stereo matching algorithm, then held fixed as the
second layer is solved using the proposed non-stationary sparse coding
algorithm. This leads to a general method for improving solutions of state of
the art MRF-based depth estimation algorithms. Our experimental results first
show that depth inference using learned representations leads to state of the
art denoising of depth maps obtained from laser range scanners and a time of
flight camera. Furthermore, we show that adding sparse priors improves the
results of two depth estimation methods: the classical graph cut algorithm by
Boykov et al. and the more recent algorithm of Woodford et al.Comment: 12 page
Steered mixture-of-experts for light field images and video : representation and coding
Research in light field (LF) processing has heavily increased over the last decade. This is largely driven by the desire to achieve the same level of immersion and navigational freedom for camera-captured scenes as it is currently available for CGI content. Standardization organizations such as MPEG and JPEG continue to follow conventional coding paradigms in which viewpoints are discretely represented on 2-D regular grids. These grids are then further decorrelated through hybrid DPCM/transform techniques. However, these 2-D regular grids are less suited for high-dimensional data, such as LFs. We propose a novel coding framework for higher-dimensional image modalities, called Steered Mixture-of-Experts (SMoE). Coherent areas in the higher-dimensional space are represented by single higher-dimensional entities, called kernels. These kernels hold spatially localized information about light rays at any angle arriving at a certain region. The global model consists thus of a set of kernels which define a continuous approximation of the underlying plenoptic function. We introduce the theory of SMoE and illustrate its application for 2-D images, 4-D LF images, and 5-D LF video. We also propose an efficient coding strategy to convert the model parameters into a bitstream. Even without provisions for high-frequency information, the proposed method performs comparable to the state of the art for low-to-mid range bitrates with respect to subjective visual quality of 4-D LF images. In case of 5-D LF video, we observe superior decorrelation and coding performance with coding gains of a factor of 4x in bitrate for the same quality. At least equally important is the fact that our method inherently has desired functionality for LF rendering which is lacking in other state-of-the-art techniques: (1) full zero-delay random access, (2) light-weight pixel-parallel view reconstruction, and (3) intrinsic view interpolation and super-resolution
Learning to Predict Image-based Rendering Artifacts with Respect to a Hidden Reference Image
Image metrics predict the perceived per-pixel difference between a reference
image and its degraded (e. g., re-rendered) version. In several important
applications, the reference image is not available and image metrics cannot be
applied. We devise a neural network architecture and training procedure that
allows predicting the MSE, SSIM or VGG16 image difference from the distorted
image alone while the reference is not observed. This is enabled by two
insights: The first is to inject sufficiently many un-distorted natural image
patches, which can be found in arbitrary amounts and are known to have no
perceivable difference to themselves. This avoids false positives. The second
is to balance the learning, where it is carefully made sure that all image
errors are equally likely, avoiding false negatives. Surprisingly, we observe,
that the resulting no-reference metric, subjectively, can even perform better
than the reference-based one, as it had to become robust against
mis-alignments. We evaluate the effectiveness of our approach in an image-based
rendering context, both quantitatively and qualitatively. Finally, we
demonstrate two applications which reduce light field capture time and provide
guidance for interactive depth adjustment.Comment: 13 pages, 11 figure
3D Face Reconstruction from Light Field Images: A Model-free Approach
Reconstructing 3D facial geometry from a single RGB image has recently
instigated wide research interest. However, it is still an ill-posed problem
and most methods rely on prior models hence undermining the accuracy of the
recovered 3D faces. In this paper, we exploit the Epipolar Plane Images (EPI)
obtained from light field cameras and learn CNN models that recover horizontal
and vertical 3D facial curves from the respective horizontal and vertical EPIs.
Our 3D face reconstruction network (FaceLFnet) comprises a densely connected
architecture to learn accurate 3D facial curves from low resolution EPIs. To
train the proposed FaceLFnets from scratch, we synthesize photo-realistic light
field images from 3D facial scans. The curve by curve 3D face estimation
approach allows the networks to learn from only 14K images of 80 identities,
which still comprises over 11 Million EPIs/curves. The estimated facial curves
are merged into a single pointcloud to which a surface is fitted to get the
final 3D face. Our method is model-free, requires only a few training samples
to learn FaceLFnet and can reconstruct 3D faces with high accuracy from single
light field images under varying poses, expressions and lighting conditions.
Comparison on the BU-3DFE and BU-4DFE datasets show that our method reduces
reconstruction errors by over 20% compared to recent state of the art
Optimized Data Representation for Interactive Multiview Navigation
In contrary to traditional media streaming services where a unique media
content is delivered to different users, interactive multiview navigation
applications enable users to choose their own viewpoints and freely navigate in
a 3-D scene. The interactivity brings new challenges in addition to the
classical rate-distortion trade-off, which considers only the compression
performance and viewing quality. On the one hand, interactivity necessitates
sufficient viewpoints for richer navigation; on the other hand, it requires to
provide low bandwidth and delay costs for smooth navigation during view
transitions. In this paper, we formally describe the novel trade-offs posed by
the navigation interactivity and classical rate-distortion criterion. Based on
an original formulation, we look for the optimal design of the data
representation by introducing novel rate and distortion models and practical
solving algorithms. Experiments show that the proposed data representation
method outperforms the baseline solution by providing lower resource
consumptions and higher visual quality in all navigation configurations, which
certainly confirms the potential of the proposed data representation in
practical interactive navigation systems
- …