33 research outputs found

    Real-Time Seamless Single Shot 6D Object Pose Prediction

    Get PDF
    We propose a single-shot approach for simultaneously detecting an object in an RGB image and predicting its 6D pose without requiring multiple stages or having to examine multiple hypotheses. Unlike a recently proposed single-shot technique for this task (Kehl et al., ICCV'17) that only predicts an approximate 6D pose that must then be refined, ours is accurate enough not to require additional post-processing. As a result, it is much faster - 50 fps on a Titan X (Pascal) GPU - and more suitable for real-time processing. The key component of our method is a new CNN architecture inspired by the YOLO network design that directly predicts the 2D image locations of the projected vertices of the object's 3D bounding box. The object's 6D pose is then estimated using a PnP algorithm. For single object and multiple object pose estimation on the LINEMOD and OCCLUSION datasets, our approach substantially outperforms other recent CNN-based approaches when they are all used without post-processing. During post-processing, a pose refinement step can be used to boost the accuracy of the existing methods, but at 10 fps or less, they are much slower than our method.Comment: CVPR 201

    Learning Separable Filters with Shared Parts

    Get PDF
    Learned image features can provide great accuracy in many Computer Vision tasks. However, when the convolution filters used to learn image features are numerous and not separable, feature extraction becomes computationally demanding and impractical to use in real-world situations. In this thesis work, a method for learning a small number of separable filters to approximate an arbitrary non-separable filter bank is developed. In this approach, separable filters are learned by grouping the arbitrary filters into a tensor and optimizing a tensor decomposition problem. The separable filter learning with tensor decomposition is general and can be applied to generic filter banks to reduce the computational burden of convolutions without a loss in performance. Moreover, the proposed approach is orders of magnitude faster than the approach of a recent studies based on l1-norm minimization

    Learning Robust Features and Latent Representations for Single View 3D Pose Estimation of Humans and Objects

    Get PDF
    Estimating the 3D poses of rigid and articulated bodies is one of the fundamental problems of Computer Vision. It has a broad range of applications including augmented reality, surveillance, animation and human-computer interaction. Despite the ever-growing demand driven by the applications, predicting 3D pose from a 2D image is a challenging and ill-posed problem due to the loss of depth information during projection from 3D to 2D. Although there have been years of research on 3D pose estimation problem, it still remains unsolved. In this thesis, we propose a variety of ways to tackle the 3D pose estimation problem both for articulated human bodies and rigid object bodies by learning robust features and latent representations. First, we present a novel video-based approach that exploits spatiotemporal features for 3D human pose estimation in a discriminative regression scheme. While early approaches typically account for motion information by temporally regularizing noisy pose estimates in individual frames, we demonstrate that taking into account motion information very early in the modeling process with spatiotemporal features yields significant performance improvements. We further propose a CNN-based motion compensation approach that stabilizes and centralizes the human body in the bounding boxes of consecutive frames to increase the reliability of spatiotemporal features. This then allows us to effectively overcome ambiguities and improve pose estimation accuracy. Second, we develop a novel Deep Learning framework for structured prediction of 3D human pose. Our approach relies on an auto-encoder to learn a high-dimensional latent pose representation that accounts for joint dependencies. We combine traditional CNNs for supervised learning with auto-encoders for structured learning and demonstrate that our approach outperforms the existing ones both in terms of structure preservation and prediction accuracy. Third, we propose a 3D human pose estimation approach that relies on a two-stream neural network architecture to simultaneously exploit 2D joint location heatmaps and image features. We show that 2D pose of a person, predicted in terms of heatmaps by a fully convolutional network, provides valuable cues to disambiguate challenging poses and results in increased pose estimation accuracy. We further introduce a novel and generic trainable fusion scheme, which automatically learns where and how to fuse the features extracted from two different input modalities that a two-stream neural network operates on. Our trainable fusion framework selects the optimal network architecture on-the-fly and improves upon standard hard-coded network architectures. Fourth, we propose an efficient approach to estimate 3D pose of objects from a single RGB image. Existing methods typically detect 2D bounding boxes and then predict the object pose using a pipelined approach. The redundancy in different parts of the architecture makes such methods computationally expensive. Moreover, the final pose estimation accuracy depends on the accuracy of the intermediate 2D object detection step. In our method, the object is classified and its pose is regressed in a single shot from the full image using a single, compact fully convolutional neural network. Our approach achieves the state-of-the-art accuracy without requiring any costly pose refinement step and runs in real-time at 50 fps on a modern GPU, which is at least 5X faster than the state of the art

    Separable Filter Learning with Tensor Decomposition

    Get PDF
    Learned image features can provide great accuracy in many Computer Vision tasks. However, when the convolution filters used to learn image features are numerous and not separable, feature extraction becomes computationally de- manding and impractical to use in real-world situations. In this thesis work, a method for learning a small number of separable filters to approximate an arbitrary non-separable filter bank is developed. In this approach, separable filters are learned by grouping the arbitrary filters into a tensor and optimizing a tensor decomposition problem. The separable filter learning with tensor decomposition is general and can be applied to generic filter banks to reduce the computational burden of convolutions without a loss in perfor- mance. Moreover, the proposed approach is orders of magnitude faster than the approach of a very recent paper based on L1-norm minimization

    Benefits of consistency in image denoising with steerable wavelets

    Get PDF
    The steerable wavelet transform is a redundant image representation with the remarkable property that its basis functions can be adaptively rotated to a desired orientation. This makes the transform well-suited to the design of wavelet-based algorithms applicable to images with a high amount of directional features. However, arbitrary modification of the wavelet-domain coefficients may violate consistency constraints because a legitimate representation must be redundant. In this paper, by honoring the redundancy of the coefficients, we demonstrate that it is possible to improve the performance of regularized least-squares problems in the steerable wavelet domain. We illustrate that our consistent method significantly improves upon the performance of conventional denoising with steerable wavelets

    Direct Prediction of 3D Body Poses from Motion Compensated Sequences

    Get PDF
    We propose an efficient approach to exploiting motion information from consecutive frames of a video sequence to recover the 3D pose of people. Previous approaches typically compute candidate poses in individual frames and then link them in a post-processing step to resolve ambiguities. By contrast, we directly regress from a spatio-temporal volume of bounding boxes to a 3D pose in the central frame. We further show that, for this approach to achieve its full potential, it is essential to compensate for the motion in consecutive frames so that the subject remains centered. This then allows us to effectively overcome ambiguities and improve upon the state-of-the-art by a large margin on the Human3.6m, HumanEva, and KTH Multiview Football 3D human pose estimation benchmarks

    FoundPose: Unseen Object Pose Estimation with Foundation Features

    Full text link
    We propose FoundPose, a method for 6D pose estimation of unseen rigid objects from a single RGB image. The method assumes that 3D models of the objects are available but does not require any object-specific training. This is achieved by building upon DINOv2, a recent vision foundation model with impressive generalization capabilities. An online pose estimation stage is supported by a minimal object representation that is built during a short onboarding stage from DINOv2 patch features extracted from rendered object templates. Given a query image with an object segmentation mask, FoundPose first rapidly retrieves a handful of similarly looking templates by a DINOv2-based bag-of-words approach. Pose hypotheses are then generated from 2D-3D correspondences established by matching DINOv2 patch features between the query image and a retrieved template, and finally optimized by featuremetric refinement. The method can handle diverse objects, including challenging ones with symmetries and without any texture, and noticeably outperforms existing RGB methods for coarse pose estimation in both accuracy and speed on the standard BOP benchmark. With the featuremetric and additional MegaPose refinement, which are demonstrated complementary, the method outperforms all RGB competitors. Source code is at: evinpinar.github.io/foundpose

    Learning to Fuse 2D and 3D Image Cues for Monocular Body Pose Estimation

    Get PDF
    Most recent approaches to monocular 3D human pose estimation rely on Deep Learning. They typically involve regressing from an image to either 3D joint coordinates directly or 2D joint locations from which 3D coordinates are inferred. Both approaches have their strengths and weaknesses and we therefore propose a novel architecture designed to deliver the best of both worlds by performing both simultaneously and fusing the information along the way. At the heart of our framework is a trainable fusion scheme that learns how to fuse the information optimally instead of being hand-designed. This yields significant improvements upon the state-of-the-art on standard 3D human pose estimation benchmarks
    corecore