1,792 research outputs found
Neural View-Interpolation for Sparse Light Field Video
We suggest representing light field (LF) videos as "one-off" neural networks (NN), i.e., a learned mapping from view-plus-time coordinates to high-resolution color values, trained on sparse views. Initially, this sounds like a bad idea for three main reasons: First, a NN LF will likely have less quality than a same-sized pixel basis representation. Second, only few training data, e.g., 9 exemplars per frame are available for sparse LF videos. Third, there is no generalization across LFs, but across view and time instead. Consequently, a network needs to be trained for each LF video. Surprisingly, these problems can turn into substantial advantages: Other than the linear pixel basis, a NN has to come up with a compact, non-linear i.e., more intelligent, explanation of color, conditioned on the sparse view and time coordinates. As observed for many NN however, this representation now is interpolatable: if the image output for sparse view coordinates is plausible, it is for all intermediate, continuous coordinates as well. Our specific network architecture involves a differentiable occlusion-aware warping step, which leads to a compact set of trainable parameters and consequently fast learning and fast execution
Unsupervised Learning of Depth and Ego-Motion from Video
We present an unsupervised learning framework for the task of monocular depth
and camera motion estimation from unstructured video sequences. We achieve this
by simultaneously training depth and camera pose estimation networks using the
task of view synthesis as the supervisory signal. The networks are thus coupled
via the view synthesis objective during training, but can be applied
independently at test time. Empirical evaluation on the KITTI dataset
demonstrates the effectiveness of our approach: 1) monocular depth performing
comparably with supervised methods that use either ground-truth pose or depth
for training, and 2) pose estimation performing favorably with established SLAM
systems under comparable input settings.Comment: Accepted to CVPR 2017. Project webpage:
https://people.eecs.berkeley.edu/~tinghuiz/projects/SfMLearner
Tracking by Animation: Unsupervised Learning of Multi-Object Attentive Trackers
Online Multi-Object Tracking (MOT) from videos is a challenging computer
vision task which has been extensively studied for decades. Most of the
existing MOT algorithms are based on the Tracking-by-Detection (TBD) paradigm
combined with popular machine learning approaches which largely reduce the
human effort to tune algorithm parameters. However, the commonly used
supervised learning approaches require the labeled data (e.g., bounding boxes),
which is expensive for videos. Also, the TBD framework is usually suboptimal
since it is not end-to-end, i.e., it considers the task as detection and
tracking, but not jointly. To achieve both label-free and end-to-end learning
of MOT, we propose a Tracking-by-Animation framework, where a differentiable
neural model first tracks objects from input frames and then animates these
objects into reconstructed frames. Learning is then driven by the
reconstruction error through backpropagation. We further propose a
Reprioritized Attentive Tracking to improve the robustness of data association.
Experiments conducted on both synthetic and real video datasets show the
potential of the proposed model. Our project page is publicly available at:
https://github.com/zhen-he/tracking-by-animationComment: CVPR 201
3D-Aware Neural Body Fitting for Occlusion Robust 3D Human Pose Estimation
Regression-based methods for 3D human pose estimation directly predict the 3D
pose parameters from a 2D image using deep networks. While achieving
state-of-the-art performance on standard benchmarks, their performance degrades
under occlusion. In contrast, optimization-based methods fit a parametric body
model to 2D features in an iterative manner. The localized reconstruction loss
can potentially make them robust to occlusion, but they suffer from the 2D-3D
ambiguity.
Motivated by the recent success of generative models in rigid object pose
estimation, we propose 3D-aware Neural Body Fitting (3DNBF) - an approximate
analysis-by-synthesis approach to 3D human pose estimation with SOTA
performance and occlusion robustness. In particular, we propose a generative
model of deep features based on a volumetric human representation with Gaussian
ellipsoidal kernels emitting 3D pose-dependent feature vectors. The neural
features are trained with contrastive learning to become 3D-aware and hence to
overcome the 2D-3D ambiguity.
Experiments show that 3DNBF outperforms other approaches on both occluded and
standard benchmarks. Code is available at https://github.com/edz-o/3DNBFComment: ICCV 2023, project page: https://3dnbf.github.io
- …