Search CORE

780 research outputs found

3D Shape Segmentation with Projective Convolutional Networks

Author: Averkiou Melinos
Chaudhuri Siddhartha
Kalogerakis Evangelos
Maji Subhransu
Publication venue
Publication date: 13/11/2017
Field of study

This paper introduces a deep architecture for segmenting 3D objects into their labeled semantic parts. Our architecture combines image-based Fully Convolutional Networks (FCNs) and surface-based Conditional Random Fields (CRFs) to yield coherent segmentations of 3D shapes. The image-based FCNs are used for efficient view-based reasoning about 3D object parts. Through a special projection layer, FCN outputs are effectively aggregated across multiple views and scales, then are projected onto the 3D object surfaces. Finally, a surface-based CRF combines the projected outputs with geometric consistency cues to yield coherent segmentations. The whole architecture (multi-view FCNs and CRF) is trained end-to-end. Our approach significantly outperforms the existing state-of-the-art methods in the currently largest segmentation benchmark (ShapeNet). Finally, we demonstrate promising segmentation results on noisy 3D shapes acquired from consumer-grade depth cameras.Comment: This is an updated version of our CVPR 2017 paper. We incorporated new experiments that demonstrate ShapePFCN performance under the case of consistent *upright* orientation and an additional input channel in our rendered images for encoding height from the ground plane (upright axis coordinate values). Performance is improved in this settin

arXiv.org e-Print Archive

Crossref

Cross-dimensional Analysis for Improved Scene Understanding

Author: Hueting Moos
Publication venue: UCL (University College London)
Publication date: 28/01/2018
Field of study

Visual data have taken up an increasingly large role in our society. Most people have instant access to a high quality camera in their pockets, and we are taking more pictures than ever before. Meanwhile, through the advent of better software and hardware, the prevalence of 3D data is also rapidly expanding, and demand for data and analysis methods is burgeoning in a wide range of industries. The amount of information about the world implicitly contained in this stream of data is staggering. However, as these images and models are created in uncontrolled circumstances, the extraction of any structured information from the unstructured pixels and vertices is highly non-trivial. To aid this process, we note that the 2D and 3D data modalities are similar in content, but intrinsically different in form. Exploiting their complementary nature, we can investigate certain problems in a cross-dimensional fashion - for example, where 2D lacks expressiveness, 3D can supplement it; where 3D lacks quality, 2D can provide it. In this thesis, we explore three analysis tasks with this insight as our point of departure. First, we show that by considering the tasks of 2D and 3D retrieval jointly we can improve performance of 3D retrieval while simultaneously enabling interesting new ways of exploring 2D retrieval results. Second, we discuss a compact representation of indoor scenes called a "scene map", which represents the objects in a scene using a top-down map of object locations. We propose a method for automatically extracting such scene maps from single 2D images using a database of 3D models for training. Finally, we seek to convert single 2D images to full 3D scenes using a database of 3D models as input. Occlusion is handled by modelling object context explicitly, allowing us to identify and pose objects that would otherwise be too occluded to make inferences about. For all three tasks, we show the utility of our cross-dimensional insight by evaluating each method extensively and showing favourable performance over baseline methods

UCL Discovery

Deep Learning-Based Human Pose Estimation: A Survey

Author: Chen Chen
Kehtarnavaz Nasser
Liu Ruixu
Shah Mubarak
Shen Ju
Wu Wenhan
Yang Taojiannan
Zheng Ce
Zhu Sijie
Publication venue
Publication date: 02/01/2021
Field of study

Human pose estimation aims to locate the human body parts and build human body representation (e.g., body skeleton) from input data such as images and videos. It has drawn increasing attention during the past decade and has been utilized in a wide range of applications including human-computer interaction, motion analysis, augmented reality, and virtual reality. Although the recently developed deep learning-based solutions have achieved high performance in human pose estimation, there still remain challenges due to insufficient training data, depth ambiguities, and occlusion. The goal of this survey paper is to provide a comprehensive review of recent deep learning-based solutions for both 2D and 3D pose estimation via a systematic analysis and comparison of these solutions based on their input data and inference procedures. More than 240 research papers since 2014 are covered in this survey. Furthermore, 2D and 3D human pose estimation datasets and evaluation metrics are included. Quantitative performance comparisons of the reviewed methods on popular datasets are summarized and discussed. Finally, the challenges involved, applications, and future research directions are concluded. We also provide a regularly updated project page: \url{https://github.com/zczcwh/DL-HPE

arXiv.org e-Print Archive

3D Scene Geometry Estimation from 360 $^\circ$ Imagery: A Survey

Author: da Silveira Thiago Lopes Trugillo
Jung Claudio Rosito
Llerena Jeffri Erwin Murrugarra
Pinto Paulo Gamarra Lessa
Publication venue
Publication date: 17/01/2024
Field of study

This paper provides a comprehensive survey on pioneer and state-of-the-art 3D scene geometry estimation methodologies based on single, two, or multiple images captured under the omnidirectional optics. We first revisit the basic concepts of the spherical camera model, and review the most common acquisition technologies and representation formats suitable for omnidirectional (also called 360

^\circ

, spherical or panoramic) images and videos. We then survey monocular layout and depth inference approaches, highlighting the recent advances in learning-based solutions suited for spherical data. The classical stereo matching is then revised on the spherical domain, where methodologies for detecting and describing sparse and dense features become crucial. The stereo matching concepts are then extrapolated for multiple view camera setups, categorizing them among light fields, multi-view stereo, and structure from motion (or visual simultaneous localization and mapping). We also compile and discuss commonly adopted datasets and figures of merit indicated for each purpose and list recent results for completeness. We conclude this paper by pointing out current and future trends.Comment: Published in ACM Computing Survey

arXiv.org e-Print Archive

Vision as inverse graphics for detailed scene understanding

Author: Moreno Comellas Pol
Publication venue: The University of Edinburgh
Publication date: 01/07/2019
Field of study

An image of a scene can be described by the shape, pose and appearance of the objects within it, as well as the illumination, and the camera that captured it. A fundamental goal in computer vision is to recover such descriptions from an image. Such representations can be useful for tasks such as autonomous robotic interaction with an environment, but obtaining them can be very challenging due the large variability of objects present in natural scenes. A long-standing approach in computer vision is to use generative models of images in order to infer the descriptions that generated the image. These methods are referred to as “vision as inverse graphics” or “inverse graphics”. We propose using this approach to scene understanding by making use of a generative model (GM) in the form of a graphics renderer. Since searching over scene factors to obtain the best match for an image is very inefficient, we make use of convolutional neural networks, which we refer to as the recognition models (RM), trained on synthetic data to initialize the search. First we address the effect that occlusions on objects have on the performance of predictive models of images. We propose an inverse graphics approach to predicting the shape, pose, appearance and illumination with a GM which includes an outlier model to account for occlusions. We study how the inferences are affected by the degree of occlusion of the foreground object, and show that a robust GM which includes an outlier model to account for occlusions works significantly better than a non-robust model. We then characterize the performance of the RM and the gains that can be made by refining the search using the robust GM, using a new synthetic dataset that includes background clutter and occlusions. We find that pose and shape are predicted very well by the RM, but appearance and especially illumination less so. However, accuracy on these latter two factors can be clearly improved with the generative model. Next we apply our inverse graphics approach to scenes with multiple objects. We propose using a method to efficiently and differentiably model self shadowing which improves the realism of the GM renders. We also propose a way to render object occlusion boundaries which results in more accurate gradients of the rendering function. We evaluate these improvements using a dataset with multiple objects and show that the refinement step of the GM clearly improves on the predictions of the RM for the latent variables of shape, pose, appearance and illumination. Finally we tackle the task of learning generative models of 3D objects from a collection of meshes. We present a latent variable architecture that learns to separately capture the underlying factors of shape and appearance from the meshes. To do so we first transform the meshes of a given class to a data representation that sidesteps the need for landmark correspondences across meshes when learning the GM. The ability and usefulness of learning a disentangled latent representation of objects is demonstrated via an experiment where the appearance of one object is transferred onto the shape of another

Edinburgh Research Archive

Recommended from our members

Leveraging Depth for 3D Scene Perception

Author: Zhao Yunhan
Publication venue: eScholarship, University of California
Publication date: 01/01/2024
Field of study

3D scene perception aims to understand the geometric and semantic information of the surrounding environment. It is crucial in many downstream applications, such as autonomous driving, robotics, AR/VR, and human-computer interaction. Despite its significance, understanding the 3D scene has been a challenging task, due to the complex interactions between objects, heavy occlusions, cluttered indoor environments, major appearance, viewpoint and scale changes, etc. The study of 3D scene perception has been significantly reshaped by the powerful deep learning models. These models are capable of leveraging large-scale training data to achieve outstanding performance. Learning-based models unlock new challenges and opportunities in the field.In this dissertation, we first present learning-based approaches to estimate depth maps, one of the crucial information in many 3D scene perception models. We describe two overlooked challenges in learning monocular depth estimators and present our proposed solutions. Specifically, we address the high-level domain gap between real and synthetic training data and the shift in camera pose distribution between training and testing data. Following that we present two application-driven works that leverage depth maps to achieve better 3D scene perception. We explore in detail the tasks of reference-based image inpainting and 3D object instance tracking in scenes from egocentric videos

eScholarship - University of California

Camera Re-Localization with Data Augmentation by Image Rendering and Image-to-Image Translation

Author: Müller Markus
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2020
Field of study

Die Selbstlokalisierung von Automobilen, Robotern oder unbemannten Luftfahrzeugen sowie die Selbstlokalisierung von Fußgängern ist und wird für eine Vielzahl an Anwendungen von hohem Interesse sein. Eine Hauptaufgabe ist die autonome Navigation von solchen Fahrzeugen, wobei die Lokalisierung in der umgebenden Szene eine Schlüsselkomponente darstellt. Da Kameras etablierte fest verbaute Sensoren in Automobilen, Robotern und unbemannten Luftfahrzeugen sind, ist der Mehraufwand diese auch für Aufgaben der Lokalisierung zu verwenden gering bis gar nicht vorhanden. Das gleiche gilt für die Selbstlokalisierung von Fußgängern, bei der Smartphones als mobile Plattformen für Kameras zum Einsatz kommen. Kamera-Relokalisierung, bei der die Pose einer Kamera bezüglich einer festen Umgebung bestimmt wird, ist ein wertvoller Prozess um eine Lösung oder Unterstützung der Lokalisierung für Fahrzeuge oder Fußgänger darzustellen. Kameras sind zudem kostengünstige Sensoren welche im Alltag von Menschen und Maschinen etabliert sind. Die Unterstützung von Kamera-Relokalisierung ist nicht auf Anwendungen bezüglich der Navigation begrenzt, sondern kann allgemein zur Unterstützung von Bildanalyse oder Bildverarbeitung wie Szenenrekonstruktion, Detektion, Klassifizierung oder ähnlichen Anwendungen genutzt werden. Für diese Zwecke, befasst sich diese Arbeit mit der Verbesserung des Prozesses der Kamera-Relokalisierung. Da Convolutional Neural Networks (CNNs) und hybride Lösungen um die Posen von Kameras zu bestimmen in den letzten Jahren mit etablierten manuell entworfenen Methoden konkurrieren, ist der Fokus in dieser Thesis auf erstere Methoden gesetzt. Die Hauptbeiträge dieser Arbeit beinhalten den Entwurf eines CNN zur Schätzung von Kameraposen, wobei der Schwerpunkt auf einer flachen Architektur liegt, die den Anforderungen an mobile Plattformen genügt. Dieses Netzwerk erreicht Genauigkeiten in gleichem Grad wie tiefere CNNs mit umfangreicheren Modelgrößen. Desweiteren ist die Performanz von CNNs stark von der Quantität und Qualität der zugrundeliegenden Trainingsdaten, die für die Optimierung genutzt werden, abhängig. Daher, befassen sich die weiteren Beiträge dieser Thesis mit dem Rendern von Bildern und Bild-zu-Bild Umwandlungen zur Erweiterung solcher Trainingsdaten. Das generelle Erweitern solcher Trainingsdaten wird Data Augmentation (DA) genannt. Für das Rendern von Bildern zur nützlichen Erweiterung von Trainingsdaten werden 3D Modelle genutzt. Generative Adversarial Networks (GANs) dienen zur Bild-zu-Bild Umwandlung. Während das Rendern von Bildern die Quantität in einem Bilddatensatz erhöht, verbessert die Bild-zu-Bild Umwandlung die Qualität dieser gerenderten Daten. Experimente werden sowohl mit erweiterten Datensätzen aus gerenderten Bildern als auch mit umgewandelten Bildern durchgeführt. Beide Ansätze der DA tragen zur Verbesserung der Genauigkeit der Lokalisierung bei. Somit werden in dieser Arbeit Kamera-Relokalisierung mit modernsten Methoden durch DA verbessert

KITopen

Single View 3D Reconstruction using Deep Learning

Author: Johnston Adrian Robert
Publication venue
Publication date: 01/01/2020
Field of study

One of the major challenges in the field of Computer Vision has been the reconstruction of a 3D object or scene from a single 2D image. While there are many notable examples, traditional methods for single view reconstruction often fail to generalise due to the presence of many brittle hand-crafted engineering solutions, limiting their applicability to real world problems. Recently, deep learning has taken over the field of Computer Vision and ”learning to reconstruct” has become the dominant technique for addressing the limitations of traditional methods when performing single view 3D reconstruction. Deep learning allows our reconstruction methods to learn generalisable image features and monocular cues that would otherwise be difficult to engineer through ad-hoc hand-crafted approaches. However, it can often be difficult to efficiently integrate the various 3D shape representations within the deep learning framework. In particular, 3D volumetric representations can be adapted to work with Convolutional Neural Networks, but they are computationally expensive and memory inefficient when using local convolutional layers. Also, the successful learning of generalisable feature representations for 3D reconstruction requires large amounts of diverse training data. In practice, this is challenging for 3D training data, as it entails a costly and time consuming manual data collection and annotation process. Researchers have attempted to address these issues by utilising self-supervised learning and generative modelling techniques, however these approaches often produce suboptimal results when compared with models trained on larger datasets. This thesis addresses several key challenges incurred when using deep learning for ”learning to reconstruct” 3D shapes from single view images. We observe that it is possible to learn a compressed representation for multiple categories of the 3D ShapeNet dataset, improving the computational and memory efficiency when working with 3D volumetric representations. To address the challenge of data acquisition, we leverage deep generative models to ”hallucinate” hidden or latent novel viewpoints for a given input image. Combining these images with depths estimated by a self-supervised depth estimator and the known camera properties, allowed us to reconstruct textured 3D point clouds without any ground truth 3D training data. Furthermore, we show that is is possible to improve upon the previous self-supervised monocular depth estimator by adding a self-attention and a discrete volumetric representation, significantly improving accuracy on the KITTI 2015 dataset and enabling the estimation of uncertainty depth predictions.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 202

Adelaide Research & Scholarship

Synthetic image generation and the use of virtual environments for image enhancement tasks

Author: Del Gallego Neil Patrick
Publication venue: Animo Repository
Publication date: 01/09/2023
Field of study

Deep learning networks are often difficult to train if there are insufficient image samples. Gathering real-world images tailored for a specific job takes a lot of work to perform. This dissertation explores techniques for synthetic image generation and virtual environments for various image enhancement/ correction/restoration tasks, specifically distortion correction, dehazing, shadow removal, and intrinsic image decomposition. First, given various image formation equations, such as those used in distortion correction and dehazing, synthetic image samples can be produced, provided that the equation is well-posed. Second, using virtual environments to train various image models is applicable for simulating real-world effects that are otherwise difficult to gather or replicate, such as dehazing and shadow removal. Given synthetic images, one cannot train a network directly on it as there is a possible gap between the synthetic and real domains. We have devised several techniques for generating synthetic images and formulated domain adaptation methods where our trained deep-learning networks perform competitively in distortion correction, dehazing, and shadow removal. Additional studies and directions are provided for the intrinsic image decomposition problem and the exploration of procedural content generation, where a virtual Philippine city was created as an initial prototype. Keywords: image generation, image correction, image dehazing, shadow removal, intrinsic image decomposition, computer graphics, rendering, machine learning, neural networks, domain adaptation, procedural content generation

Animo Repository - De La Salle University Research