213 research outputs found

    Rich probabilistic models for semantic labeling

    Get PDF
    Das Ziel dieser Monographie ist es die Methoden und Anwendungen des semantischen Labelings zu erforschen. Unsere Beiträge zu diesem sich rasch entwickelten Thema sind bestimmte Aspekte der Modellierung und der Inferenz in probabilistischen Modellen und ihre Anwendungen in den interdisziplinären Bereichen der Computer Vision sowie medizinischer Bildverarbeitung und Fernerkundung

    Walk2Map: Extracting Floor Plans from Indoor Walk Trajectories

    Get PDF
    Recent years have seen a proliferation of new digital products for the efficient management of indoor spaces, with important applications like emergency management, virtual property showcasing and interior design. These products rely on accurate 3D models of the environments considered, including information on both architectural and non-permanent elements. These models must be created from measured data such as RGB-D images or 3D point clouds, whose capture and consolidation involves lengthy data workflows. This strongly limits the rate at which 3D models can be produced, preventing the adoption of many digital services for indoor space management. We provide an alternative to such data-intensive procedures by presenting Walk2Map, a data-driven approach to generate floor plans only from trajectories of a person walking inside the rooms. Thanks to recent advances in data-driven inertial odometry, such minimalistic input data can be acquired from the IMU readings of consumer-level smartphones, which allows for an effortless and scalable mapping of real-world indoor spaces. Our work is based on learning the latent relation between an indoor walk trajectory and the information represented in a floor plan: interior space footprint, portals, and furniture. We distinguish between recovering area-related (interior footprint, furniture) and wall-related (doors) information and use two different neural architectures for the two tasks: an image-based Encoder-Decoder and a Graph Convolutional Network, respectively. We train our networks using scanned 3D indoor models and apply them in a cascaded fashion on an indoor walk trajectory at inference time. We perform a qualitative and quantitative evaluation using both simulated and measured, real-world trajectories, and compare against a baseline method for image-to-image translation. The experiments confirm the feasibility of our approach.Comment: To be published in Computer Graphics Forum (Proc. Eurographics 2021

    Advances in Data-Driven Analysis and Synthesis of 3D Indoor Scenes

    Full text link
    This report surveys advances in deep learning-based modeling techniques that address four different 3D indoor scene analysis tasks, as well as synthesis of 3D indoor scenes. We describe different kinds of representations for indoor scenes, various indoor scene datasets available for research in the aforementioned areas, and discuss notable works employing machine learning models for such scene modeling tasks based on these representations. Specifically, we focus on the analysis and synthesis of 3D indoor scenes. With respect to analysis, we focus on four basic scene understanding tasks -- 3D object detection, 3D scene segmentation, 3D scene reconstruction and 3D scene similarity. And for synthesis, we mainly discuss neural scene synthesis works, though also highlighting model-driven methods that allow for human-centric, progressive scene synthesis. We identify the challenges involved in modeling scenes for these tasks and the kind of machinery that needs to be developed to adapt to the data representation, and the task setting in general. For each of these tasks, we provide a comprehensive summary of the state-of-the-art works across different axes such as the choice of data representation, backbone, evaluation metric, input, output, etc., providing an organized review of the literature. Towards the end, we discuss some interesting research directions that have the potential to make a direct impact on the way users interact and engage with these virtual scene models, making them an integral part of the metaverse.Comment: Published in Computer Graphics Forum, Aug 202

    Pop-up SLAM: Semantic Monocular Plane SLAM for Low-texture Environments

    Full text link
    Existing simultaneous localization and mapping (SLAM) algorithms are not robust in challenging low-texture environments because there are only few salient features. The resulting sparse or semi-dense map also conveys little information for motion planning. Though some work utilize plane or scene layout for dense map regularization, they require decent state estimation from other sources. In this paper, we propose real-time monocular plane SLAM to demonstrate that scene understanding could improve both state estimation and dense mapping especially in low-texture environments. The plane measurements come from a pop-up 3D plane model applied to each single image. We also combine planes with point based SLAM to improve robustness. On a public TUM dataset, our algorithm generates a dense semantic 3D model with pixel depth error of 6.2 cm while existing SLAM algorithms fail. On a 60 m long dataset with loops, our method creates a much better 3D model with state estimation error of 0.67%.Comment: International Conference on Intelligent Robots and Systems (IROS) 201

    iMAPPER: Interaction-guided Scene Mapping from Monocular Videos

    Get PDF
    Next generation smart and augmented reality systems demand a computational understanding of monocular footage that captures humans in physical spaces to reveal plausible object arrangements and human-object interactions. Despite recent advances, both in scene layout and human motion analysis, the above setting remains challenging to analyze due to regular occlusions that occur between objects and human motions. We observe that the interaction between object arrangements and human actions is often strongly correlated, and hence can be used to help recover from these occlusions. We present iMapper, a data-driven method to identify such human-object interactions and utilize them to infer layouts of occluded objects. Starting from a monocular video with detected 2D human joint positions that are potentially noisy and occluded, we first introduce the notion of interaction-saliency as space-time snapshots where informative human-object interactions happen. Then, we propose a global optimization to retrieve and fit interactions from a database to the detected salient interactions in order to best explain the input video. We extensively evaluate the approach, both quantitatively against manually annotated ground truth and through a user study, and demonstrate that iMapper produces plausible scene layouts for scenes with medium to heavy occlusion. Code and data are available on the project page

    Dynamic Scene Reconstruction and Understanding

    Get PDF
    Traditional approaches to 3D reconstruction have achieved remarkable progress in static scene acquisition. The acquired data serves as priors or benchmarks for many vision and graphics tasks, such as object detection and robotic navigation. Thus, obtaining interpretable and editable representations from a raw monocular RGB-D video sequence is an outstanding goal in scene understanding. However, acquiring an interpretable representation becomes significantly more challenging when a scene contains dynamic activities; for example, a moving camera, rigid object movement, and non-rigid motions. These dynamic scene elements introduce a scene factorization problem, i.e., dividing a scene into elements and jointly estimating elements’ motion and geometry. Moreover, the monocular setting brings in the problems of tracking and fusing partially occluded objects as they are scanned from one viewpoint at a time. This thesis explores several ideas for acquiring an interpretable model in dynamic environments. Firstly, we utilize synthetic assets such as floor plans and object meshes to generate dynamic data for training and evaluation. Then, we explore the idea of learning geometry priors with an instance segmentation module, which predicts the location and grouping of indoor objects. We use the learned geometry priors to infer the occluded object geometry for tracking and reconstruction. While instance segmentation modules usually have a generalization issue, i.e., struggling to handle unknown objects, we observed that the empty space information in the background geometry is more reliable for detecting moving objects. Thus, we proposed a segmentation-by-reconstruction strategy for acquiring rigidly-moving objects and backgrounds. Finally, we present a novel neural representation to learn a factorized scene representation, reconstructing every dynamic element. The proposed model supports both rigid and non-rigid motions without pre-trained templates. We demonstrate that our systems and representation improve the reconstruction quality on synthetic test sets and real-world scans

    3D Reconstruction of Indoor Corridor Models Using Single Imagery and Video Sequences

    Get PDF
    In recent years, 3D indoor modeling has gained more attention due to its role in decision-making process of maintaining the status and managing the security of building indoor spaces. In this thesis, the problem of continuous indoor corridor space modeling has been tackled through two approaches. The first approach develops a modeling method based on middle-level perceptual organization. The second approach develops a visual Simultaneous Localisation and Mapping (SLAM) system with model-based loop closure. In the first approach, the image space was searched for a corridor layout that can be converted into a geometrically accurate 3D model. Manhattan rule assumption was adopted, and indoor corridor layout hypotheses were generated through a random rule-based intersection of image physical line segments and virtual rays of orthogonal vanishing points. Volumetric reasoning, correspondences to physical edges, orientation map and geometric context of an image are all considered for scoring layout hypotheses. This approach provides physically plausible solutions while facing objects or occlusions in a corridor scene. In the second approach, Layout SLAM is introduced. Layout SLAM performs camera localization while maps layout corners and normal point features in 3D space. Here, a new feature matching cost function was proposed considering both local and global context information. In addition, a rotation compensation variable makes Layout SLAM robust against cameras orientation errors accumulations. Moreover, layout model matching of keyframes insures accurate loop closures that prevent miss-association of newly visited landmarks to previously visited scene parts. The comparison of generated single image-based 3D models to ground truth models showed that average ratio differences in widths, heights and lengths were 1.8%, 3.7% and 19.2% respectively. Moreover, Layout SLAM performed with the maximum absolute trajectory error of 2.4m in position and 8.2 degree in orientation for approximately 318m path on RAWSEEDS data set. Loop closing was strongly performed for Layout SLAM and provided 3D indoor corridor layouts with less than 1.05m displacement errors in length and less than 20cm in width and height for approximately 315m path on York University data set. The proposed methods can successfully generate 3D indoor corridor models compared to their major counterpart

    Pose2Room: Understanding 3D Scenes from Human Activities

    Full text link
    With wearable IMU sensors, one can estimate human poses from wearable devices without requiring visual input~\cite{von2017sparse}. In this work, we pose the question: Can we reason about object structure in real-world environments solely from human trajectory information? Crucially, we observe that human motion and interactions tend to give strong information about the objects in a scene -- for instance a person sitting indicates the likely presence of a chair or sofa. To this end, we propose P2R-Net to learn a probabilistic 3D model of the objects in a scene characterized by their class categories and oriented 3D bounding boxes, based on an input observed human trajectory in the environment. P2R-Net models the probability distribution of object class as well as a deep Gaussian mixture model for object boxes, enabling sampling of multiple, diverse, likely modes of object configurations from an observed human trajectory. In our experiments we show that P2R-Net can effectively learn multi-modal distributions of likely objects for human motions, and produce a variety of plausible object structures of the environment, even without any visual information. The results demonstrate that P2R-Net consistently outperforms the baselines on the PROX dataset and the VirtualHome platform.Comment: Accepted by ECCV'2022; Project page: https://yinyunie.github.io/pose2room-page/ Video: https://www.youtube.com/watch?v=MFfKTcvbM5
    • …