780 research outputs found

    3D Shape Segmentation with Projective Convolutional Networks

    Full text link
    This paper introduces a deep architecture for segmenting 3D objects into their labeled semantic parts. Our architecture combines image-based Fully Convolutional Networks (FCNs) and surface-based Conditional Random Fields (CRFs) to yield coherent segmentations of 3D shapes. The image-based FCNs are used for efficient view-based reasoning about 3D object parts. Through a special projection layer, FCN outputs are effectively aggregated across multiple views and scales, then are projected onto the 3D object surfaces. Finally, a surface-based CRF combines the projected outputs with geometric consistency cues to yield coherent segmentations. The whole architecture (multi-view FCNs and CRF) is trained end-to-end. Our approach significantly outperforms the existing state-of-the-art methods in the currently largest segmentation benchmark (ShapeNet). Finally, we demonstrate promising segmentation results on noisy 3D shapes acquired from consumer-grade depth cameras.Comment: This is an updated version of our CVPR 2017 paper. We incorporated new experiments that demonstrate ShapePFCN performance under the case of consistent *upright* orientation and an additional input channel in our rendered images for encoding height from the ground plane (upright axis coordinate values). Performance is improved in this settin

    Cross-dimensional Analysis for Improved Scene Understanding

    Get PDF
    Visual data have taken up an increasingly large role in our society. Most people have instant access to a high quality camera in their pockets, and we are taking more pictures than ever before. Meanwhile, through the advent of better software and hardware, the prevalence of 3D data is also rapidly expanding, and demand for data and analysis methods is burgeoning in a wide range of industries. The amount of information about the world implicitly contained in this stream of data is staggering. However, as these images and models are created in uncontrolled circumstances, the extraction of any structured information from the unstructured pixels and vertices is highly non-trivial. To aid this process, we note that the 2D and 3D data modalities are similar in content, but intrinsically different in form. Exploiting their complementary nature, we can investigate certain problems in a cross-dimensional fashion - for example, where 2D lacks expressiveness, 3D can supplement it; where 3D lacks quality, 2D can provide it. In this thesis, we explore three analysis tasks with this insight as our point of departure. First, we show that by considering the tasks of 2D and 3D retrieval jointly we can improve performance of 3D retrieval while simultaneously enabling interesting new ways of exploring 2D retrieval results. Second, we discuss a compact representation of indoor scenes called a "scene map", which represents the objects in a scene using a top-down map of object locations. We propose a method for automatically extracting such scene maps from single 2D images using a database of 3D models for training. Finally, we seek to convert single 2D images to full 3D scenes using a database of 3D models as input. Occlusion is handled by modelling object context explicitly, allowing us to identify and pose objects that would otherwise be too occluded to make inferences about. For all three tasks, we show the utility of our cross-dimensional insight by evaluating each method extensively and showing favourable performance over baseline methods

    Deep Learning-Based Human Pose Estimation: A Survey

    Full text link
    Human pose estimation aims to locate the human body parts and build human body representation (e.g., body skeleton) from input data such as images and videos. It has drawn increasing attention during the past decade and has been utilized in a wide range of applications including human-computer interaction, motion analysis, augmented reality, and virtual reality. Although the recently developed deep learning-based solutions have achieved high performance in human pose estimation, there still remain challenges due to insufficient training data, depth ambiguities, and occlusion. The goal of this survey paper is to provide a comprehensive review of recent deep learning-based solutions for both 2D and 3D pose estimation via a systematic analysis and comparison of these solutions based on their input data and inference procedures. More than 240 research papers since 2014 are covered in this survey. Furthermore, 2D and 3D human pose estimation datasets and evaluation metrics are included. Quantitative performance comparisons of the reviewed methods on popular datasets are summarized and discussed. Finally, the challenges involved, applications, and future research directions are concluded. We also provide a regularly updated project page: \url{https://github.com/zczcwh/DL-HPE

    3D Scene Geometry Estimation from 360∘^\circ Imagery: A Survey

    Full text link
    This paper provides a comprehensive survey on pioneer and state-of-the-art 3D scene geometry estimation methodologies based on single, two, or multiple images captured under the omnidirectional optics. We first revisit the basic concepts of the spherical camera model, and review the most common acquisition technologies and representation formats suitable for omnidirectional (also called 360∘^\circ, spherical or panoramic) images and videos. We then survey monocular layout and depth inference approaches, highlighting the recent advances in learning-based solutions suited for spherical data. The classical stereo matching is then revised on the spherical domain, where methodologies for detecting and describing sparse and dense features become crucial. The stereo matching concepts are then extrapolated for multiple view camera setups, categorizing them among light fields, multi-view stereo, and structure from motion (or visual simultaneous localization and mapping). We also compile and discuss commonly adopted datasets and figures of merit indicated for each purpose and list recent results for completeness. We conclude this paper by pointing out current and future trends.Comment: Published in ACM Computing Survey

    Vision as inverse graphics for detailed scene understanding

    Get PDF
    An image of a scene can be described by the shape, pose and appearance of the objects within it, as well as the illumination, and the camera that captured it. A fundamental goal in computer vision is to recover such descriptions from an image. Such representations can be useful for tasks such as autonomous robotic interaction with an environment, but obtaining them can be very challenging due the large variability of objects present in natural scenes. A long-standing approach in computer vision is to use generative models of images in order to infer the descriptions that generated the image. These methods are referred to as “vision as inverse graphics” or “inverse graphics”. We propose using this approach to scene understanding by making use of a generative model (GM) in the form of a graphics renderer. Since searching over scene factors to obtain the best match for an image is very inefficient, we make use of convolutional neural networks, which we refer to as the recognition models (RM), trained on synthetic data to initialize the search. First we address the effect that occlusions on objects have on the performance of predictive models of images. We propose an inverse graphics approach to predicting the shape, pose, appearance and illumination with a GM which includes an outlier model to account for occlusions. We study how the inferences are affected by the degree of occlusion of the foreground object, and show that a robust GM which includes an outlier model to account for occlusions works significantly better than a non-robust model. We then characterize the performance of the RM and the gains that can be made by refining the search using the robust GM, using a new synthetic dataset that includes background clutter and occlusions. We find that pose and shape are predicted very well by the RM, but appearance and especially illumination less so. However, accuracy on these latter two factors can be clearly improved with the generative model. Next we apply our inverse graphics approach to scenes with multiple objects. We propose using a method to efficiently and differentiably model self shadowing which improves the realism of the GM renders. We also propose a way to render object occlusion boundaries which results in more accurate gradients of the rendering function. We evaluate these improvements using a dataset with multiple objects and show that the refinement step of the GM clearly improves on the predictions of the RM for the latent variables of shape, pose, appearance and illumination. Finally we tackle the task of learning generative models of 3D objects from a collection of meshes. We present a latent variable architecture that learns to separately capture the underlying factors of shape and appearance from the meshes. To do so we first transform the meshes of a given class to a data representation that sidesteps the need for landmark correspondences across meshes when learning the GM. The ability and usefulness of learning a disentangled latent representation of objects is demonstrated via an experiment where the appearance of one object is transferred onto the shape of another

    Camera Re-Localization with Data Augmentation by Image Rendering and Image-to-Image Translation

    Get PDF
    Die Selbstlokalisierung von Automobilen, Robotern oder unbemannten Luftfahrzeugen sowie die Selbstlokalisierung von FußgĂ€ngern ist und wird fĂŒr eine Vielzahl an Anwendungen von hohem Interesse sein. Eine Hauptaufgabe ist die autonome Navigation von solchen Fahrzeugen, wobei die Lokalisierung in der umgebenden Szene eine SchlĂŒsselkomponente darstellt. Da Kameras etablierte fest verbaute Sensoren in Automobilen, Robotern und unbemannten Luftfahrzeugen sind, ist der Mehraufwand diese auch fĂŒr Aufgaben der Lokalisierung zu verwenden gering bis gar nicht vorhanden. Das gleiche gilt fĂŒr die Selbstlokalisierung von FußgĂ€ngern, bei der Smartphones als mobile Plattformen fĂŒr Kameras zum Einsatz kommen. Kamera-Relokalisierung, bei der die Pose einer Kamera bezĂŒglich einer festen Umgebung bestimmt wird, ist ein wertvoller Prozess um eine Lösung oder UnterstĂŒtzung der Lokalisierung fĂŒr Fahrzeuge oder FußgĂ€nger darzustellen. Kameras sind zudem kostengĂŒnstige Sensoren welche im Alltag von Menschen und Maschinen etabliert sind. Die UnterstĂŒtzung von Kamera-Relokalisierung ist nicht auf Anwendungen bezĂŒglich der Navigation begrenzt, sondern kann allgemein zur UnterstĂŒtzung von Bildanalyse oder Bildverarbeitung wie Szenenrekonstruktion, Detektion, Klassifizierung oder Ă€hnlichen Anwendungen genutzt werden. FĂŒr diese Zwecke, befasst sich diese Arbeit mit der Verbesserung des Prozesses der Kamera-Relokalisierung. Da Convolutional Neural Networks (CNNs) und hybride Lösungen um die Posen von Kameras zu bestimmen in den letzten Jahren mit etablierten manuell entworfenen Methoden konkurrieren, ist der Fokus in dieser Thesis auf erstere Methoden gesetzt. Die HauptbeitrĂ€ge dieser Arbeit beinhalten den Entwurf eines CNN zur SchĂ€tzung von Kameraposen, wobei der Schwerpunkt auf einer flachen Architektur liegt, die den Anforderungen an mobile Plattformen genĂŒgt. Dieses Netzwerk erreicht Genauigkeiten in gleichem Grad wie tiefere CNNs mit umfangreicheren ModelgrĂ¶ĂŸen. Desweiteren ist die Performanz von CNNs stark von der QuantitĂ€t und QualitĂ€t der zugrundeliegenden Trainingsdaten, die fĂŒr die Optimierung genutzt werden, abhĂ€ngig. Daher, befassen sich die weiteren BeitrĂ€ge dieser Thesis mit dem Rendern von Bildern und Bild-zu-Bild Umwandlungen zur Erweiterung solcher Trainingsdaten. Das generelle Erweitern solcher Trainingsdaten wird Data Augmentation (DA) genannt. FĂŒr das Rendern von Bildern zur nĂŒtzlichen Erweiterung von Trainingsdaten werden 3D Modelle genutzt. Generative Adversarial Networks (GANs) dienen zur Bild-zu-Bild Umwandlung. WĂ€hrend das Rendern von Bildern die QuantitĂ€t in einem Bilddatensatz erhöht, verbessert die Bild-zu-Bild Umwandlung die QualitĂ€t dieser gerenderten Daten. Experimente werden sowohl mit erweiterten DatensĂ€tzen aus gerenderten Bildern als auch mit umgewandelten Bildern durchgefĂŒhrt. Beide AnsĂ€tze der DA tragen zur Verbesserung der Genauigkeit der Lokalisierung bei. Somit werden in dieser Arbeit Kamera-Relokalisierung mit modernsten Methoden durch DA verbessert

    Single View 3D Reconstruction using Deep Learning

    Get PDF
    One of the major challenges in the field of Computer Vision has been the reconstruction of a 3D object or scene from a single 2D image. While there are many notable examples, traditional methods for single view reconstruction often fail to generalise due to the presence of many brittle hand-crafted engineering solutions, limiting their applicability to real world problems. Recently, deep learning has taken over the field of Computer Vision and ”learning to reconstruct” has become the dominant technique for addressing the limitations of traditional methods when performing single view 3D reconstruction. Deep learning allows our reconstruction methods to learn generalisable image features and monocular cues that would otherwise be difficult to engineer through ad-hoc hand-crafted approaches. However, it can often be difficult to efficiently integrate the various 3D shape representations within the deep learning framework. In particular, 3D volumetric representations can be adapted to work with Convolutional Neural Networks, but they are computationally expensive and memory inefficient when using local convolutional layers. Also, the successful learning of generalisable feature representations for 3D reconstruction requires large amounts of diverse training data. In practice, this is challenging for 3D training data, as it entails a costly and time consuming manual data collection and annotation process. Researchers have attempted to address these issues by utilising self-supervised learning and generative modelling techniques, however these approaches often produce suboptimal results when compared with models trained on larger datasets. This thesis addresses several key challenges incurred when using deep learning for ”learning to reconstruct” 3D shapes from single view images. We observe that it is possible to learn a compressed representation for multiple categories of the 3D ShapeNet dataset, improving the computational and memory efficiency when working with 3D volumetric representations. To address the challenge of data acquisition, we leverage deep generative models to ”hallucinate” hidden or latent novel viewpoints for a given input image. Combining these images with depths estimated by a self-supervised depth estimator and the known camera properties, allowed us to reconstruct textured 3D point clouds without any ground truth 3D training data. Furthermore, we show that is is possible to improve upon the previous self-supervised monocular depth estimator by adding a self-attention and a discrete volumetric representation, significantly improving accuracy on the KITTI 2015 dataset and enabling the estimation of uncertainty depth predictions.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 202

    Synthetic image generation and the use of virtual environments for image enhancement tasks

    Get PDF
    Deep learning networks are often difficult to train if there are insufficient image samples. Gathering real-world images tailored for a specific job takes a lot of work to perform. This dissertation explores techniques for synthetic image generation and virtual environments for various image enhancement/ correction/restoration tasks, specifically distortion correction, dehazing, shadow removal, and intrinsic image decomposition. First, given various image formation equations, such as those used in distortion correction and dehazing, synthetic image samples can be produced, provided that the equation is well-posed. Second, using virtual environments to train various image models is applicable for simulating real-world effects that are otherwise difficult to gather or replicate, such as dehazing and shadow removal. Given synthetic images, one cannot train a network directly on it as there is a possible gap between the synthetic and real domains. We have devised several techniques for generating synthetic images and formulated domain adaptation methods where our trained deep-learning networks perform competitively in distortion correction, dehazing, and shadow removal. Additional studies and directions are provided for the intrinsic image decomposition problem and the exploration of procedural content generation, where a virtual Philippine city was created as an initial prototype. Keywords: image generation, image correction, image dehazing, shadow removal, intrinsic image decomposition, computer graphics, rendering, machine learning, neural networks, domain adaptation, procedural content generation
    • 

    corecore