1,370 research outputs found

    Recovering 6D Object Pose: A Review and Multi-modal Analysis

    Full text link
    A large number of studies analyse object detection and pose estimation at visual level in 2D, discussing the effects of challenges such as occlusion, clutter, texture, etc., on the performances of the methods, which work in the context of RGB modality. Interpreting the depth data, the study in this paper presents thorough multi-modal analyses. It discusses the above-mentioned challenges for full 6D object pose estimation in RGB-D images comparing the performances of several 6D detectors in order to answer the following questions: What is the current position of the computer vision community for maintaining "automation" in robotic manipulation? What next steps should the community take for improving "autonomy" in robotics while handling objects? Our findings include: (i) reasonably accurate results are obtained on textured-objects at varying viewpoints with cluttered backgrounds. (ii) Heavy existence of occlusion and clutter severely affects the detectors, and similar-looking distractors is the biggest challenge in recovering instances' 6D. (iii) Template-based methods and random forest-based learning algorithms underlie object detection and 6D pose estimation. Recent paradigm is to learn deep discriminative feature representations and to adopt CNNs taking RGB images as input. (iv) Depending on the availability of large-scale 6D annotated depth datasets, feature representations can be learnt on these datasets, and then the learnt representations can be customized for the 6D problem

    GASP : Geometric Association with Surface Patches

    Full text link
    A fundamental challenge to sensory processing tasks in perception and robotics is the problem of obtaining data associations across views. We present a robust solution for ascertaining potentially dense surface patch (superpixel) associations, requiring just range information. Our approach involves decomposition of a view into regularized surface patches. We represent them as sequences expressing geometry invariantly over their superpixel neighborhoods, as uniquely consistent partial orderings. We match these representations through an optimal sequence comparison metric based on the Damerau-Levenshtein distance - enabling robust association with quadratic complexity (in contrast to hitherto employed joint matching formulations which are NP-complete). The approach is able to perform under wide baselines, heavy rotations, partial overlaps, significant occlusions and sensor noise. The technique does not require any priors -- motion or otherwise, and does not make restrictive assumptions on scene structure and sensor movement. It does not require appearance -- is hence more widely applicable than appearance reliant methods, and invulnerable to related ambiguities such as textureless or aliased content. We present promising qualitative and quantitative results under diverse settings, along with comparatives with popular approaches based on range as well as RGB-D data.Comment: International Conference on 3D Vision, 201

    Robust Wide-Baseline Stereo Matching for Sparsely Textured Scenes

    Get PDF
    The task of wide baseline stereo matching algorithms is to identify corresponding elements in pairs of overlapping images taken from significantly different viewpoints. Such algorithms are a key ingredient to many computer vision applications, including object recognition, automatic camera orientation, 3D reconstruction and image registration. Although today's methods for wide baseline stereo matching produce reliable results for typical application scenarios, they assume properties of the image data that are not always granted, for example a significant amount of distinctive surface texture. For such problems, highly advanced algorithms have been proposed, which are often very problem specific, difficult to implement and hard to transfer to new matching problems. The motivation for our work comes from the belief that we can find a generic formulation for robust wide baseline image matching that is able to solve difficult matching problems and at the same time applicable to a variety of applications. It should be easy to implement, and have good semantic interpretability. Therefore our key contribution is the development of a generic statistical model for wide baseline stereo matching, which seamlessly integrates different types of image features, similarity measures and spatial feature relationships as information cues. It unifies the ideas of existing approaches into a Bayesian formulation, which has a clear statistical interpretation as the MAP estimate of a binary classification problem. The model ultimately takes the form of a global minimization problem that can be solved with standard optimization techniques. The particular type of features, measures, and spatial relationships however is not prescribed. A major advantage of our model over existing approaches is its ability to compensate weaknesses in one information cue implicitly by exploiting the strength of others. In our experiments we concentrate on images of sparsely textured scenes as a specifically difficult matching problem. Here the amount of stable image features is typically rather small, and the distinctiveness of feature descriptions often low. We use the proposed framework to implement a wide baseline stereo matching algorithm that can deal better with poor texture than established methods. For demonstrating the practical relevance, we also apply this algorithm to a system for automatic image orientation. Here, the task is to reconstruct the relative 3D positions and orientations of the cameras corresponding to a set of overlapping images. We show that our implementation leads to more successful results in case of sparsely textured scenes, while still retaining state of the art performance on standard datasets.Robuste Merkmalszuordnung für Bildpaare schwach texturierter Szenen mit deutlicher Stereobasis Die Aufgabe von Wide Baseline Stereo Matching Algorithmen besteht darin, korrespondierende Elemente in Paaren überlappender Bilder mit deutlich verschiedenen Kamerapositionen zu bestimmen. Solche Algorithmen sind ein grundlegender Baustein für zahlreiche Computer Vision Anwendungen wie Objekterkennung, automatische Kameraorientierung, 3D Rekonstruktion und Bildregistrierung. Die heute etablierten Verfahren für Wide Baseline Stereo Matching funktionieren in typischen Anwendungsszenarien sehr zuverlässig. Sie setzen jedoch Eigenschaften der Bilddaten voraus, die nicht immer gegeben sind, wie beispielsweise einen hohen Anteil markanter Textur. Für solche Fälle wurden sehr komplexe Verfahren entwickelt, die jedoch oft nur auf sehr spezifische Probleme anwendbar sind, einen hohen Implementierungsaufwand erfordern, und sich zudem nur schwer auf neue Matchingprobleme übertragen lassen. Die Motivation für diese Arbeit entstand aus der Überzeugung, dass es eine möglichst allgemein anwendbare Formulierung für robustes Wide Baseline Stereo Matching geben muß, die sich zur Lösung schwieriger Zuordnungsprobleme eignet und dennoch leicht auf verschiedenartige Anwendungen angepasst werden kann. Sie sollte leicht implementierbar sein und eine hohe semantische Interpretierbarkeit aufweisen. Unser Hauptbeitrag besteht daher in der Entwicklung eines allgemeinen statistischen Modells für Wide Baseline Stereo Matching, das verschiedene Typen von Bildmerkmalen, Ähnlichkeitsmaßen und räumlichen Beziehungen nahtlos als Informationsquellen integriert. Es führt Ideen bestehender Lösungsansätze in einer Bayes'schen Formulierung zusammen, die eine klare Interpretation als MAP Schätzung eines binären Klassifikationsproblems hat. Das Modell nimmt letztlich die Form eines globalen Minimierungsproblems an, das mit herkömmlichen Optimierungsverfahren gelöst werden kann. Der konkrete Typ der verwendeten Bildmerkmale, Ähnlichkeitsmaße und räumlichen Beziehungen ist nicht explizit vorgeschrieben. Ein wichtiger Vorteil unseres Modells gegenüber vergleichbaren Verfahren ist seine Fähigkeit, Schwachpunkte einer Informationsquelle implizit durch die Stärken anderer Informationsquellen zu kompensieren. In unseren Experimenten konzentrieren wir uns insbesondere auf Bilder schwach texturierter Szenen als ein Beispiel schwieriger Zuordnungsprobleme. Die Anzahl stabiler Bildmerkmale ist hier typischerweise gering, und die Unterscheidbarkeit der Merkmalsbeschreibungen schlecht. Anhand des vorgeschlagenen Modells implementieren wir einen konkreten Wide Baseline Stereo Matching Algorithmus, der besser mit schwacher Textur umgehen kann als herkömmliche Verfahren. Um die praktische Relevanz zu verdeutlichen, wenden wir den Algorithmus für die automatische Bildorientierung an. Hier besteht die Aufgabe darin, zu einer Menge überlappender Bilder die relativen 3D Kamerapositionen und Kameraorientierungen zu bestimmen. Wir zeigen, dass der Algorithmus im Fall schwach texturierter Szenen bessere Ergebnisse als etablierte Verfahren ermöglicht, und dennoch bei Standard-Datensätzen vergleichbare Ergebnisse liefert

    Local, Semi-Local and Global Models for Texture, Object and Scene Recognition

    Get PDF
    This dissertation addresses the problems of recognizing textures, objects, and scenes in photographs. We present approaches to these recognition tasks that combine salient local image features with spatial relations and effective discriminative learning techniques. First, we introduce a bag of features image model for recognizing textured surfaces under a wide range of transformations, including viewpoint changes and non-rigid deformations. We present results of a large-scale comparative evaluation indicating that bags of features can be effective not only for texture, but also for object categization, even in the presence of substantial clutter and intra-class variation. We also show how to augment the purely local image representation with statistical co-occurrence relations between pairs of nearby features, and develop a learning and classification framework for the task of classifying individual features in a multi-texture image. Next, we present a more structured alternative to bags of features for object recognition, namely, an image representation based on semi-local parts, or groups of features characterized by stable appearance and geometric layout. Semi-local parts are automatically learned from small sets of unsegmented, cluttered images. Finally, we present a global method for recognizing scene categories that works by partitioning the image into increasingly fine sub-regions and computing histograms of local features found inside each sub-region. The resulting spatial pyramid representation demonstrates significantly improved performance on challenging scene categorization tasks

    Leveraging 3D City Models for Rotation Invariant Place-of-Interest Recognition

    Get PDF
    Given a cell phone image of a building we address the problem of place-of-interest recognition in urban scenarios. Here, we go beyond what has been shown in earlier approaches by exploiting the nowadays often available 3D building information (e.g. from extruded floor plans) and massive street-level image data for database creation. Exploiting vanishing points in query images and thus fully removing 3D rotation from the recognition problem allows then to simplify the feature invariance to a purely homothetic problem, which we show enables more discriminative power in feature descriptors than classical SIFT. We rerank visual word based document queries using a fast stratified homothetic verification that in most cases boosts the correct document to top positions if it was in the short list. Since we exploit 3D building information, the approach finally outputs the camera pose in real world coordinates ready for augmenting the cell phone image with virtual 3D information. The whole system is demonstrated to outperform traditional approaches on city scale experiments for different sources of street-level image data and a challenging set of cell phone image

    A Robust RGBD Slam System for 3D Environment with Planar Surfaces

    Get PDF
    With the increasing popularity of RGB-depth (RGB-D) sensors such as the Microsoft Kinect, there have been much research on capturing and reconstructing 3D environments using a movable RGB-D sensor. The key process behind these kinds of simultaneous location and mapping (SLAM) systems is the iterative closest point or ICP algorithm, which is an iterative algorithm that can estimate the rigid movement of the camera based on the captured 3D point clouds. While ICP is a well-studied algorithm, it is problematic when it is used in scanning large planar regions such as wall surfaces in a room. The lack of depth variations on planar surfaces makes the global alignment an ill-conditioned problem. In this paper, we present a novel approach for registering 3D point clouds by combining both color and depth information. Instead of directly searching for point correspondences among 3D data, the proposed method first extracts features from the RGB images, and then back-projects the features to the 3D space to identify more reliable correspondences. These color correspondences form the initial input to the ICP procedure which then proceeds to refine the alignment. Experimental results show that our proposed approach can achieve better accuracy than existing SLAMs in reconstructing indoor environments with large planar surfaces
    corecore