1,370 research outputs found
Recovering 6D Object Pose: A Review and Multi-modal Analysis
A large number of studies analyse object detection and pose estimation at
visual level in 2D, discussing the effects of challenges such as occlusion,
clutter, texture, etc., on the performances of the methods, which work in the
context of RGB modality. Interpreting the depth data, the study in this paper
presents thorough multi-modal analyses. It discusses the above-mentioned
challenges for full 6D object pose estimation in RGB-D images comparing the
performances of several 6D detectors in order to answer the following
questions: What is the current position of the computer vision community for
maintaining "automation" in robotic manipulation? What next steps should the
community take for improving "autonomy" in robotics while handling objects? Our
findings include: (i) reasonably accurate results are obtained on
textured-objects at varying viewpoints with cluttered backgrounds. (ii) Heavy
existence of occlusion and clutter severely affects the detectors, and
similar-looking distractors is the biggest challenge in recovering instances'
6D. (iii) Template-based methods and random forest-based learning algorithms
underlie object detection and 6D pose estimation. Recent paradigm is to learn
deep discriminative feature representations and to adopt CNNs taking RGB images
as input. (iv) Depending on the availability of large-scale 6D annotated depth
datasets, feature representations can be learnt on these datasets, and then the
learnt representations can be customized for the 6D problem
GASP : Geometric Association with Surface Patches
A fundamental challenge to sensory processing tasks in perception and
robotics is the problem of obtaining data associations across views. We present
a robust solution for ascertaining potentially dense surface patch (superpixel)
associations, requiring just range information. Our approach involves
decomposition of a view into regularized surface patches. We represent them as
sequences expressing geometry invariantly over their superpixel neighborhoods,
as uniquely consistent partial orderings. We match these representations
through an optimal sequence comparison metric based on the Damerau-Levenshtein
distance - enabling robust association with quadratic complexity (in contrast
to hitherto employed joint matching formulations which are NP-complete). The
approach is able to perform under wide baselines, heavy rotations, partial
overlaps, significant occlusions and sensor noise.
The technique does not require any priors -- motion or otherwise, and does
not make restrictive assumptions on scene structure and sensor movement. It
does not require appearance -- is hence more widely applicable than appearance
reliant methods, and invulnerable to related ambiguities such as textureless or
aliased content. We present promising qualitative and quantitative results
under diverse settings, along with comparatives with popular approaches based
on range as well as RGB-D data.Comment: International Conference on 3D Vision, 201
Robust Wide-Baseline Stereo Matching for Sparsely Textured Scenes
The task of wide baseline stereo matching algorithms is to identify corresponding elements in pairs of overlapping images taken from significantly different viewpoints. Such algorithms are a key ingredient to many computer vision applications, including object recognition, automatic camera orientation, 3D reconstruction and image registration. Although today's methods for wide baseline stereo matching produce reliable results for typical application scenarios, they assume properties of the image data that are not always granted, for example a significant amount of distinctive surface texture. For such problems, highly advanced algorithms have been proposed, which are often very problem specific, difficult to implement and hard to transfer to new matching problems. The motivation for our work comes from the belief that we can find a generic formulation for robust wide baseline image matching that is able to solve difficult matching problems and at the same time applicable to a variety of applications. It should be easy to implement, and have good semantic interpretability. Therefore our key contribution is the development of a generic statistical model for wide baseline stereo matching, which seamlessly integrates different types of image features, similarity measures and spatial feature relationships as information cues. It unifies the ideas of existing approaches into a Bayesian formulation, which has a clear statistical interpretation as the MAP estimate of a binary classification problem. The model ultimately takes the form of a global minimization problem that can be solved with standard optimization techniques. The particular type of features, measures, and spatial relationships however is not prescribed. A major advantage of our model over existing approaches is its ability to compensate weaknesses in one information cue implicitly by exploiting the strength of others. In our experiments we concentrate on images of sparsely textured scenes as a specifically difficult matching problem. Here the amount of stable image features is typically rather small, and the distinctiveness of feature descriptions often low. We use the proposed framework to implement a wide baseline stereo matching algorithm that can deal better with poor texture than established methods. For demonstrating the practical relevance, we also apply this algorithm to a system for automatic image orientation. Here, the task is to reconstruct the relative 3D positions and orientations of the cameras corresponding to a set of overlapping images. We show that our implementation leads to more successful results in case of sparsely textured scenes, while still retaining state of the art performance on standard datasets.Robuste Merkmalszuordnung für Bildpaare schwach texturierter Szenen mit deutlicher Stereobasis Die Aufgabe von Wide Baseline Stereo Matching Algorithmen besteht darin, korrespondierende Elemente in Paaren überlappender Bilder mit deutlich verschiedenen Kamerapositionen zu bestimmen. Solche Algorithmen sind ein grundlegender Baustein für zahlreiche Computer Vision Anwendungen wie Objekterkennung, automatische Kameraorientierung, 3D Rekonstruktion und Bildregistrierung. Die heute etablierten Verfahren für Wide Baseline Stereo Matching funktionieren in typischen Anwendungsszenarien sehr zuverlässig. Sie setzen jedoch Eigenschaften der Bilddaten voraus, die nicht immer gegeben sind, wie beispielsweise einen hohen Anteil markanter Textur. Für solche Fälle wurden sehr komplexe Verfahren entwickelt, die jedoch oft nur auf sehr spezifische Probleme anwendbar sind, einen hohen Implementierungsaufwand erfordern, und sich zudem nur schwer auf neue Matchingprobleme übertragen lassen. Die Motivation für diese Arbeit entstand aus der Überzeugung, dass es eine möglichst allgemein anwendbare Formulierung für robustes Wide Baseline Stereo Matching geben muß, die sich zur Lösung schwieriger Zuordnungsprobleme eignet und dennoch leicht auf verschiedenartige Anwendungen angepasst werden kann. Sie sollte leicht implementierbar sein und eine hohe semantische Interpretierbarkeit aufweisen. Unser Hauptbeitrag besteht daher in der Entwicklung eines allgemeinen statistischen Modells für Wide Baseline Stereo Matching, das verschiedene Typen von Bildmerkmalen, Ähnlichkeitsmaßen und räumlichen Beziehungen nahtlos als Informationsquellen integriert. Es führt Ideen bestehender Lösungsansätze in einer Bayes'schen Formulierung zusammen, die eine klare Interpretation als MAP Schätzung eines binären Klassifikationsproblems hat. Das Modell nimmt letztlich die Form eines globalen Minimierungsproblems an, das mit herkömmlichen Optimierungsverfahren gelöst werden kann. Der konkrete Typ der verwendeten Bildmerkmale, Ähnlichkeitsmaße und räumlichen Beziehungen ist nicht explizit vorgeschrieben. Ein wichtiger Vorteil unseres Modells gegenüber vergleichbaren Verfahren ist seine Fähigkeit, Schwachpunkte einer Informationsquelle implizit durch die Stärken anderer Informationsquellen zu kompensieren. In unseren Experimenten konzentrieren wir uns insbesondere auf Bilder schwach texturierter Szenen als ein Beispiel schwieriger Zuordnungsprobleme. Die Anzahl stabiler Bildmerkmale ist hier typischerweise gering, und die Unterscheidbarkeit der Merkmalsbeschreibungen schlecht. Anhand des vorgeschlagenen Modells implementieren wir einen konkreten Wide Baseline Stereo Matching Algorithmus, der besser mit schwacher Textur umgehen kann als herkömmliche Verfahren. Um die praktische Relevanz zu verdeutlichen, wenden wir den Algorithmus für die automatische Bildorientierung an. Hier besteht die Aufgabe darin, zu einer Menge überlappender Bilder die relativen 3D Kamerapositionen und Kameraorientierungen zu bestimmen. Wir zeigen, dass der Algorithmus im Fall schwach texturierter Szenen bessere Ergebnisse als etablierte Verfahren ermöglicht, und dennoch bei Standard-Datensätzen vergleichbare Ergebnisse liefert
Local, Semi-Local and Global Models for Texture, Object and Scene Recognition
This dissertation addresses the problems of recognizing textures, objects, and scenes in photographs. We present approaches to these recognition tasks that combine salient local image features with spatial relations and effective discriminative learning techniques. First, we introduce a bag of features image model for recognizing textured surfaces under a wide range of transformations, including viewpoint changes and non-rigid deformations. We present results of a large-scale comparative evaluation indicating that bags of features can be effective not only for texture, but also for object categization, even in the presence of substantial clutter and intra-class variation. We also show how to augment the purely local image representation with statistical co-occurrence relations between pairs of nearby features, and develop a learning and classification framework for the task of classifying individual features in a multi-texture image. Next, we present a more structured alternative to bags of features for object recognition, namely, an image representation based on semi-local parts, or groups of features characterized by stable appearance and geometric layout. Semi-local parts are automatically learned from small sets of unsegmented, cluttered images. Finally, we present a global method for recognizing scene categories that works by partitioning the image into increasingly fine sub-regions and computing histograms of local features found inside each sub-region. The resulting spatial pyramid representation demonstrates significantly improved performance on challenging scene categorization tasks
Leveraging 3D City Models for Rotation Invariant Place-of-Interest Recognition
Given a cell phone image of a building we address the problem of place-of-interest recognition in urban scenarios. Here, we go beyond what has been shown in earlier approaches by exploiting the nowadays often available 3D building information (e.g. from extruded floor plans) and massive street-level image data for database creation. Exploiting vanishing points in query images and thus fully removing 3D rotation from the recognition problem allows then to simplify the feature invariance to a purely homothetic problem, which we show enables more discriminative power in feature descriptors than classical SIFT. We rerank visual word based document queries using a fast stratified homothetic verification that in most cases boosts the correct document to top positions if it was in the short list. Since we exploit 3D building information, the approach finally outputs the camera pose in real world coordinates ready for augmenting the cell phone image with virtual 3D information. The whole system is demonstrated to outperform traditional approaches on city scale experiments for different sources of street-level image data and a challenging set of cell phone image
A Robust RGBD Slam System for 3D Environment with Planar Surfaces
With the increasing popularity of RGB-depth (RGB-D) sensors such as the Microsoft Kinect, there have been much research on capturing and reconstructing 3D environments using a movable RGB-D sensor. The key process behind these kinds of simultaneous location and mapping (SLAM) systems is the iterative closest point or ICP algorithm, which is an iterative algorithm that can estimate the rigid movement of the camera based on the captured 3D point clouds. While ICP is a well-studied algorithm, it is problematic when it is used in scanning large planar regions such as wall surfaces in a room. The lack of depth variations on planar surfaces makes the global alignment an ill-conditioned problem. In this paper, we present a novel approach for registering 3D point clouds by combining both color and depth information. Instead of directly searching for point correspondences among 3D data, the proposed method first extracts features from the RGB images, and then back-projects the features to the 3D space to identify more reliable correspondences. These color correspondences form the initial input to the ICP procedure which then proceeds to refine the alignment. Experimental results show that our proposed approach can achieve better accuracy than existing SLAMs in reconstructing indoor environments with large planar surfaces
- …