1,432 research outputs found

    Learning View-Model Joint Relevance for 3D Object Retrieval

    Get PDF
    3D object retrieval has attracted extensive research efforts and become an important task in recent years. It is noted that how to measure the relevance between 3D objects is still a difficult issue. Most of the existing methods employ just the model-based or view-based approaches, which may lead to incomplete information for 3D object representation. In this paper, we propose to jointly learn the view-model relevance among 3D objects for retrieval, in which the 3D objects are formulated in different graph structures. With the view information, the multiple views of 3D objects are employed to formulate the 3D object relationship in an object hypergraph structure. With the model data, the model-based features are extracted to construct an object graph to describe the relationship among the 3D objects. The learning on the two graphs is conducted to estimate the relevance among the 3D objects, in which the view/model graph weights can be also optimized in the learning process. This is the first work to jointly explore the view-based and model-based relevance among the 3D objects in a graph-based framework. The proposed method has been evaluated in three data sets. The experimental results and comparison with the state-of-the-art methods demonstrate the effectiveness on retrieval accuracy of the proposed 3D object retrieval method

    Mapping, Localization and Path Planning for Image-based Navigation using Visual Features and Map

    Full text link
    Building on progress in feature representations for image retrieval, image-based localization has seen a surge of research interest. Image-based localization has the advantage of being inexpensive and efficient, often avoiding the use of 3D metric maps altogether. That said, the need to maintain a large number of reference images as an effective support of localization in a scene, nonetheless calls for them to be organized in a map structure of some kind. The problem of localization often arises as part of a navigation process. We are, therefore, interested in summarizing the reference images as a set of landmarks, which meet the requirements for image-based navigation. A contribution of this paper is to formulate such a set of requirements for the two sub-tasks involved: map construction and self-localization. These requirements are then exploited for compact map representation and accurate self-localization, using the framework of a network flow problem. During this process, we formulate the map construction and self-localization problems as convex quadratic and second-order cone programs, respectively. We evaluate our methods on publicly available indoor and outdoor datasets, where they outperform existing methods significantly.Comment: CVPR 2019, for implementation see https://github.com/janinethom

    Leveraging Deep Visual Descriptors for Hierarchical Efficient Localization

    Full text link
    Many robotics applications require precise pose estimates despite operating in large and changing environments. This can be addressed by visual localization, using a pre-computed 3D model of the surroundings. The pose estimation then amounts to finding correspondences between 2D keypoints in a query image and 3D points in the model using local descriptors. However, computational power is often limited on robotic platforms, making this task challenging in large-scale environments. Binary feature descriptors significantly speed up this 2D-3D matching, and have become popular in the robotics community, but also strongly impair the robustness to perceptual aliasing and changes in viewpoint, illumination and scene structure. In this work, we propose to leverage recent advances in deep learning to perform an efficient hierarchical localization. We first localize at the map level using learned image-wide global descriptors, and subsequently estimate a precise pose from 2D-3D matches computed in the candidate places only. This restricts the local search and thus allows to efficiently exploit powerful non-binary descriptors usually dismissed on resource-constrained devices. Our approach results in state-of-the-art localization performance while running in real-time on a popular mobile platform, enabling new prospects for robotics research.Comment: CoRL 2018 Camera-ready (fix typos and update citations

    Video Registration in Egocentric Vision under Day and Night Illumination Changes

    Full text link
    With the spread of wearable devices and head mounted cameras, a wide range of application requiring precise user localization is now possible. In this paper we propose to treat the problem of obtaining the user position with respect to a known environment as a video registration problem. Video registration, i.e. the task of aligning an input video sequence to a pre-built 3D model, relies on a matching process of local keypoints extracted on the query sequence to a 3D point cloud. The overall registration performance is strictly tied to the actual quality of this 2D-3D matching, and can degrade if environmental conditions such as steep changes in lighting like the ones between day and night occur. To effectively register an egocentric video sequence under these conditions, we propose to tackle the source of the problem: the matching process. To overcome the shortcomings of standard matching techniques, we introduce a novel embedding space that allows us to obtain robust matches by jointly taking into account local descriptors, their spatial arrangement and their temporal robustness. The proposal is evaluated using unconstrained egocentric video sequences both in terms of matching quality and resulting registration performance using different 3D models of historical landmarks. The results show that the proposed method can outperform state of the art registration algorithms, in particular when dealing with the challenges of night and day sequences

    On Rearrangement of Items Stored in Stacks

    Full text link
    There are n2n \ge 2 stacks, each filled with dd items, and one empty stack. Every stack has capacity d>0d > 0. A robot arm, in one stack operation (step), may pop one item from the top of a non-empty stack and subsequently push it onto a stack not at capacity. In a {\em labeled} problem, all ndnd items are distinguishable and are initially randomly scattered in the nn stacks. The items must be rearranged using pop-and-pushs so that in the end, the kthk^{\rm th} stack holds items (k1)d+1,,kd(k-1)d +1, \ldots, kd, in that order, from the top to the bottom for all 1kn1 \le k \le n. In an {\em unlabeled} problem, the ndnd items are of nn types of dd each. The goal is to rearrange items so that items of type kk are located in the kthk^{\rm th} stack for all 1kn1 \le k \le n. In carrying out the rearrangement, a natural question is to find the least number of required pop-and-pushes. Our main contributions are: (1) an algorithm for restoring the order of n2n^2 items stored in an n×nn \times n table using only 2n2n column and row permutations, and its generalization, and (2) an algorithm with a guaranteed upper bound of O(nd)O(nd) steps for solving both versions of the stack rearrangement problem when dcnd \le \lceil cn \rceil for arbitrary fixed positive number cc. In terms of the required number of steps, the labeled and unlabeled version have lower bounds Ω(nd+ndlogdlogn)\Omega(nd + nd{\frac{\log d}{\log n}}) and Ω(nd)\Omega(nd), respectively

    Learning 3D Scene Priors with 2D Supervision

    Full text link
    Holistic 3D scene understanding entails estimation of both layout configuration and object geometry in a 3D environment. Recent works have shown advances in 3D scene estimation from various input modalities (e.g., images, 3D scans), by leveraging 3D supervision (e.g., 3D bounding boxes or CAD models), for which collection at scale is expensive and often intractable. To address this shortcoming, we propose a new method to learn 3D scene priors of layout and shape without requiring any 3D ground truth. Instead, we rely on 2D supervision from multi-view RGB images. Our method represents a 3D scene as a latent vector, from which we can progressively decode to a sequence of objects characterized by their class categories, 3D bounding boxes, and meshes. With our trained autoregressive decoder representing the scene prior, our method facilitates many downstream applications, including scene synthesis, interpolation, and single-view reconstruction. Experiments on 3D-FRONT and ScanNet show that our method outperforms state of the art in single-view reconstruction, and achieves state-of-the-art results in scene synthesis against baselines which require for 3D supervision.Comment: Video: https://youtu.be/YT7MEdygRoY Project: https://yinyunie.github.io/sceneprior-page

    3D Shape Knowledge Graph for Cross-domain and Cross-modal 3D Shape Retrieval

    Full text link
    With the development of 3D modeling and fabrication, 3D shape retrieval has become a hot topic. In recent years, several strategies have been put forth to address this retrieval issue. However, it is difficult for them to handle cross-modal 3D shape retrieval because of the natural differences between modalities. In this paper, we propose an innovative concept, namely, geometric words, which is regarded as the basic element to represent any 3D or 2D entity by combination, and assisted by which, we can simultaneously handle cross-domain or cross-modal retrieval problems. First, to construct the knowledge graph, we utilize the geometric word as the node, and then use the category of the 3D shape as well as the attribute of the geometry to bridge the nodes. Second, based on the knowledge graph, we provide a unique way for learning each entity's embedding. Finally, we propose an effective similarity measure to handle the cross-domain and cross-modal 3D shape retrieval. Specifically, every 3D or 2D entity could locate its geometric terms in the 3D knowledge graph, which serve as a link between cross-domain and cross-modal data. Thus, our approach can achieve the cross-domain and cross-modal 3D shape retrieval at the same time. We evaluated our proposed method on the ModelNet40 dataset and ShapeNetCore55 dataset for both the 3D shape retrieval task and cross-domain 3D shape retrieval task. The classic cross-modal dataset (MI3DOR) is utilized to evaluate cross-modal 3D shape retrieval. Experimental results and comparisons with state-of-the-art methods illustrate the superiority of our approach
    corecore