33 research outputs found
Real-Time Seamless Single Shot 6D Object Pose Prediction
We propose a single-shot approach for simultaneously detecting an object in
an RGB image and predicting its 6D pose without requiring multiple stages or
having to examine multiple hypotheses. Unlike a recently proposed single-shot
technique for this task (Kehl et al., ICCV'17) that only predicts an
approximate 6D pose that must then be refined, ours is accurate enough not to
require additional post-processing. As a result, it is much faster - 50 fps on
a Titan X (Pascal) GPU - and more suitable for real-time processing. The key
component of our method is a new CNN architecture inspired by the YOLO network
design that directly predicts the 2D image locations of the projected vertices
of the object's 3D bounding box. The object's 6D pose is then estimated using a
PnP algorithm.
For single object and multiple object pose estimation on the LINEMOD and
OCCLUSION datasets, our approach substantially outperforms other recent
CNN-based approaches when they are all used without post-processing. During
post-processing, a pose refinement step can be used to boost the accuracy of
the existing methods, but at 10 fps or less, they are much slower than our
method.Comment: CVPR 201
Learning Separable Filters with Shared Parts
Learned image features can provide great accuracy in many Computer Vision tasks. However, when the convolution filters used to learn image features are numerous and not separable, feature extraction becomes computationally demanding and impractical to use in real-world situations. In this thesis work, a method for learning a small number of separable filters to approximate an arbitrary non-separable filter bank is developed. In this approach, separable filters are learned by grouping the arbitrary filters into a tensor and optimizing a tensor decomposition problem. The separable filter learning with tensor decomposition is general and can be applied to generic filter banks to reduce the computational burden of convolutions without a loss in performance. Moreover, the proposed approach is orders of magnitude faster than the approach of a recent studies based on l1-norm minimization
Learning Robust Features and Latent Representations for Single View 3D Pose Estimation of Humans and Objects
Estimating the 3D poses of rigid and articulated bodies is one of the fundamental problems of Computer Vision. It has a broad range of applications including augmented reality, surveillance, animation and human-computer interaction. Despite the ever-growing demand driven by the applications, predicting 3D pose from a 2D image is a challenging and ill-posed problem due to the loss of depth information during projection from 3D to 2D. Although there have been years of research on 3D pose estimation problem, it still remains unsolved. In this thesis, we propose a variety of ways to tackle the 3D pose estimation problem both for articulated human bodies and rigid object bodies by learning robust features and latent representations.
First, we present a novel video-based approach that exploits spatiotemporal features for 3D human pose estimation in a discriminative regression scheme. While early approaches typically account for motion information by temporally regularizing noisy pose estimates in individual frames, we demonstrate that taking into account motion information very early in the modeling process with spatiotemporal features yields significant performance improvements. We further propose a CNN-based motion compensation approach that stabilizes and centralizes the human body in the bounding boxes of consecutive frames to increase the reliability of spatiotemporal features. This then allows us to effectively overcome ambiguities and improve pose estimation accuracy.
Second, we develop a novel Deep Learning framework for structured prediction of 3D human pose. Our approach relies on an auto-encoder to learn a high-dimensional latent pose representation that accounts for joint dependencies. We combine traditional CNNs for supervised learning with auto-encoders for structured learning and demonstrate that our approach outperforms the existing ones both in terms of structure preservation and prediction accuracy.
Third, we propose a 3D human pose estimation approach that relies on a two-stream neural network architecture to simultaneously exploit 2D joint location heatmaps and image features. We show that 2D pose of a person, predicted in terms of heatmaps by a fully convolutional network, provides valuable cues to disambiguate challenging poses and results in increased pose estimation accuracy. We further introduce a novel and generic trainable fusion scheme, which automatically learns where and how to fuse the features extracted from two different input modalities that a two-stream neural network operates on. Our trainable fusion framework selects the optimal network architecture on-the-fly and improves upon standard hard-coded network architectures.
Fourth, we propose an efficient approach to estimate 3D pose of objects from a single RGB image. Existing methods typically detect 2D bounding boxes and then predict the object pose using a pipelined approach. The redundancy in different parts of the architecture makes such methods computationally expensive. Moreover, the final pose estimation accuracy depends on the accuracy of the intermediate 2D object detection step. In our method, the object is classified and its pose is regressed in a single shot from the full image using a single, compact fully convolutional neural network. Our approach achieves the state-of-the-art accuracy without requiring any costly pose refinement step and runs in real-time at 50 fps on a modern GPU, which is at least 5X faster than the state of the art
Separable Filter Learning with Tensor Decomposition
Learned image features can provide great accuracy in many Computer Vision tasks. However, when the convolution filters used to learn image features are numerous and not separable, feature extraction becomes computationally de- manding and impractical to use in real-world situations. In this thesis work, a method for learning a small number of separable filters to approximate an arbitrary non-separable filter bank is developed. In this approach, separable filters are learned by grouping the arbitrary filters into a tensor and optimizing a tensor decomposition problem. The separable filter learning with tensor decomposition is general and can be applied to generic filter banks to reduce the computational burden of convolutions without a loss in perfor- mance. Moreover, the proposed approach is orders of magnitude faster than the approach of a very recent paper based on L1-norm minimization
Benefits of consistency in image denoising with steerable wavelets
The steerable wavelet transform is a redundant image representation with the remarkable property that its basis functions can be adaptively rotated to a desired orientation. This makes the transform well-suited to the design of wavelet-based algorithms applicable to images with a high amount of directional features. However, arbitrary modification of the wavelet-domain coefficients may violate consistency constraints because a legitimate representation must be redundant. In this paper, by honoring the redundancy of the coefficients, we demonstrate that it is possible to improve the performance of regularized least-squares problems in the steerable wavelet domain. We illustrate that our consistent method significantly improves upon the performance of conventional denoising with steerable wavelets
Direct Prediction of 3D Body Poses from Motion Compensated Sequences
We propose an efficient approach to exploiting motion information from consecutive frames of a video sequence to recover the 3D pose of people. Previous approaches typically compute candidate poses in individual frames and then link them in a post-processing step to resolve ambiguities. By contrast, we directly regress from a spatio-temporal volume of bounding boxes to a 3D pose in the central frame. We further show that, for this approach to achieve its full potential, it is essential to compensate for the motion in consecutive frames so that the subject remains centered. This then allows us to effectively overcome ambiguities and improve upon the state-of-the-art by a large margin on the Human3.6m, HumanEva, and KTH Multiview Football 3D human pose estimation benchmarks
FoundPose: Unseen Object Pose Estimation with Foundation Features
We propose FoundPose, a method for 6D pose estimation of unseen rigid objects
from a single RGB image. The method assumes that 3D models of the objects are
available but does not require any object-specific training. This is achieved
by building upon DINOv2, a recent vision foundation model with impressive
generalization capabilities. An online pose estimation stage is supported by a
minimal object representation that is built during a short onboarding stage
from DINOv2 patch features extracted from rendered object templates. Given a
query image with an object segmentation mask, FoundPose first rapidly retrieves
a handful of similarly looking templates by a DINOv2-based bag-of-words
approach. Pose hypotheses are then generated from 2D-3D correspondences
established by matching DINOv2 patch features between the query image and a
retrieved template, and finally optimized by featuremetric refinement. The
method can handle diverse objects, including challenging ones with symmetries
and without any texture, and noticeably outperforms existing RGB methods for
coarse pose estimation in both accuracy and speed on the standard BOP
benchmark. With the featuremetric and additional MegaPose refinement, which are
demonstrated complementary, the method outperforms all RGB competitors. Source
code is at: evinpinar.github.io/foundpose
Learning to Fuse 2D and 3D Image Cues for Monocular Body Pose Estimation
Most recent approaches to monocular 3D human pose estimation rely on Deep
Learning. They typically involve regressing from an image to either 3D joint
coordinates directly or 2D joint locations from which 3D coordinates are
inferred. Both approaches have their strengths and weaknesses and we therefore
propose a novel architecture designed to deliver the best of both worlds by
performing both simultaneously and fusing the information along the way. At the
heart of our framework is a trainable fusion scheme that learns how to fuse the
information optimally instead of being hand-designed. This yields significant
improvements upon the state-of-the-art on standard 3D human pose estimation
benchmarks