780 research outputs found
3D Shape Segmentation with Projective Convolutional Networks
This paper introduces a deep architecture for segmenting 3D objects into
their labeled semantic parts. Our architecture combines image-based Fully
Convolutional Networks (FCNs) and surface-based Conditional Random Fields
(CRFs) to yield coherent segmentations of 3D shapes. The image-based FCNs are
used for efficient view-based reasoning about 3D object parts. Through a
special projection layer, FCN outputs are effectively aggregated across
multiple views and scales, then are projected onto the 3D object surfaces.
Finally, a surface-based CRF combines the projected outputs with geometric
consistency cues to yield coherent segmentations. The whole architecture
(multi-view FCNs and CRF) is trained end-to-end. Our approach significantly
outperforms the existing state-of-the-art methods in the currently largest
segmentation benchmark (ShapeNet). Finally, we demonstrate promising
segmentation results on noisy 3D shapes acquired from consumer-grade depth
cameras.Comment: This is an updated version of our CVPR 2017 paper. We incorporated
new experiments that demonstrate ShapePFCN performance under the case of
consistent *upright* orientation and an additional input channel in our
rendered images for encoding height from the ground plane (upright axis
coordinate values). Performance is improved in this settin
Cross-dimensional Analysis for Improved Scene Understanding
Visual data have taken up an increasingly large role in our society. Most people have instant access to a high quality camera in their pockets, and we are taking more pictures than ever before. Meanwhile, through the advent of better software and hardware, the prevalence of 3D data is also rapidly expanding, and demand for data and analysis methods is burgeoning in a wide range of industries. The amount of information about the world implicitly contained in this stream of data is staggering. However, as these images and models are created in uncontrolled circumstances, the extraction of any structured information from the unstructured pixels and vertices is highly non-trivial. To aid this process, we note that the 2D and 3D data modalities are similar in content, but intrinsically different in form. Exploiting their complementary nature, we can investigate certain problems in a cross-dimensional fashion - for example, where 2D lacks expressiveness, 3D can supplement it; where 3D lacks quality, 2D can provide it. In this thesis, we explore three analysis tasks with this insight as our point of departure. First, we show that by considering the tasks of 2D and 3D retrieval jointly we can improve performance of 3D retrieval while simultaneously enabling interesting new ways of exploring 2D retrieval results. Second, we discuss a compact representation of indoor scenes called a "scene map", which represents the objects in a scene using a top-down map of object locations. We propose a method for automatically extracting such scene maps from single 2D images using a database of 3D models for training. Finally, we seek to convert single 2D images to full 3D scenes using a database of 3D models as input. Occlusion is handled by modelling object context explicitly, allowing us to identify and pose objects that would otherwise be too occluded to make inferences about. For all three tasks, we show the utility of our cross-dimensional insight by evaluating each method extensively and showing favourable performance over baseline methods
Deep Learning-Based Human Pose Estimation: A Survey
Human pose estimation aims to locate the human body parts and build human
body representation (e.g., body skeleton) from input data such as images and
videos. It has drawn increasing attention during the past decade and has been
utilized in a wide range of applications including human-computer interaction,
motion analysis, augmented reality, and virtual reality. Although the recently
developed deep learning-based solutions have achieved high performance in human
pose estimation, there still remain challenges due to insufficient training
data, depth ambiguities, and occlusion. The goal of this survey paper is to
provide a comprehensive review of recent deep learning-based solutions for both
2D and 3D pose estimation via a systematic analysis and comparison of these
solutions based on their input data and inference procedures. More than 240
research papers since 2014 are covered in this survey. Furthermore, 2D and 3D
human pose estimation datasets and evaluation metrics are included.
Quantitative performance comparisons of the reviewed methods on popular
datasets are summarized and discussed. Finally, the challenges involved,
applications, and future research directions are concluded. We also provide a
regularly updated project page: \url{https://github.com/zczcwh/DL-HPE
3D Scene Geometry Estimation from 360 Imagery: A Survey
This paper provides a comprehensive survey on pioneer and state-of-the-art 3D
scene geometry estimation methodologies based on single, two, or multiple
images captured under the omnidirectional optics. We first revisit the basic
concepts of the spherical camera model, and review the most common acquisition
technologies and representation formats suitable for omnidirectional (also
called 360, spherical or panoramic) images and videos. We then survey
monocular layout and depth inference approaches, highlighting the recent
advances in learning-based solutions suited for spherical data. The classical
stereo matching is then revised on the spherical domain, where methodologies
for detecting and describing sparse and dense features become crucial. The
stereo matching concepts are then extrapolated for multiple view camera setups,
categorizing them among light fields, multi-view stereo, and structure from
motion (or visual simultaneous localization and mapping). We also compile and
discuss commonly adopted datasets and figures of merit indicated for each
purpose and list recent results for completeness. We conclude this paper by
pointing out current and future trends.Comment: Published in ACM Computing Survey
Vision as inverse graphics for detailed scene understanding
An image of a scene can be described by the shape, pose and appearance of the objects
within it, as well as the illumination, and the camera that captured it. A fundamental
goal in computer vision is to recover such descriptions from an image. Such representations
can be useful for tasks such as autonomous robotic interaction with an
environment, but obtaining them can be very challenging due the large variability of
objects present in natural scenes.
A long-standing approach in computer vision is to use generative models of images in
order to infer the descriptions that generated the image. These methods are referred to
as âvision as inverse graphicsâ or âinverse graphicsâ. We propose using this approach
to scene understanding by making use of a generative model (GM) in the form of
a graphics renderer. Since searching over scene factors to obtain the best match for
an image is very inefficient, we make use of convolutional neural networks, which
we refer to as the recognition models (RM), trained on synthetic data to initialize the
search.
First we address the effect that occlusions on objects have on the performance of predictive
models of images. We propose an inverse graphics approach to predicting the
shape, pose, appearance and illumination with a GM which includes an outlier model
to account for occlusions. We study how the inferences are affected by the degree
of occlusion of the foreground object, and show that a robust GM which includes
an outlier model to account for occlusions works significantly better than a non-robust
model. We then characterize the performance of the RM and the gains that can be made
by refining the search using the robust GM, using a new synthetic dataset that includes
background clutter and occlusions. We find that pose and shape are predicted very well
by the RM, but appearance and especially illumination less so. However, accuracy on
these latter two factors can be clearly improved with the generative model.
Next we apply our inverse graphics approach to scenes with multiple objects. We
propose using a method to efficiently and differentiably model self shadowing which
improves the realism of the GM renders. We also propose a way to render object occlusion
boundaries which results in more accurate gradients of the rendering function.
We evaluate these improvements using a dataset with multiple objects and show that
the refinement step of the GM clearly improves on the predictions of the RM for the
latent variables of shape, pose, appearance and illumination.
Finally we tackle the task of learning generative models of 3D objects from a collection
of meshes. We present a latent variable architecture that learns to separately capture
the underlying factors of shape and appearance from the meshes. To do so we first
transform the meshes of a given class to a data representation that sidesteps the need
for landmark correspondences across meshes when learning the GM. The ability and
usefulness of learning a disentangled latent representation of objects is demonstrated
via an experiment where the appearance of one object is transferred onto the shape of
another
Recommended from our members
Leveraging Depth for 3D Scene Perception
3D scene perception aims to understand the geometric and semantic information of the surrounding environment. It is crucial in many downstream applications, such as autonomous driving, robotics, AR/VR, and human-computer interaction. Despite its significance, understanding the 3D scene has been a challenging task, due to the complex interactions between objects, heavy occlusions, cluttered indoor environments, major appearance, viewpoint and scale changes, etc. The study of 3D scene perception has been significantly reshaped by the powerful deep learning models. These models are capable of leveraging large-scale training data to achieve outstanding performance. Learning-based models unlock new challenges and opportunities in the field.In this dissertation, we first present learning-based approaches to estimate depth maps, one of the crucial information in many 3D scene perception models. We describe two overlooked challenges in learning monocular depth estimators and present our proposed solutions. Specifically, we address the high-level domain gap between real and synthetic training data and the shift in camera pose distribution between training and testing data. Following that we present two application-driven works that leverage depth maps to achieve better 3D scene perception. We explore in detail the tasks of reference-based image inpainting and 3D object instance tracking in scenes from egocentric videos
Camera Re-Localization with Data Augmentation by Image Rendering and Image-to-Image Translation
Die Selbstlokalisierung von Automobilen, Robotern oder unbemannten Luftfahrzeugen sowie die Selbstlokalisierung von FuĂgĂ€ngern ist und wird fĂŒr eine Vielzahl an Anwendungen von hohem Interesse sein.
Eine Hauptaufgabe ist die autonome Navigation von solchen Fahrzeugen, wobei die Lokalisierung in der umgebenden Szene eine SchlĂŒsselkomponente darstellt.
Da Kameras etablierte fest verbaute Sensoren in Automobilen, Robotern und unbemannten Luftfahrzeugen sind, ist der Mehraufwand diese auch fĂŒr Aufgaben der Lokalisierung zu verwenden gering bis gar nicht vorhanden.
Das gleiche gilt fĂŒr die Selbstlokalisierung von FuĂgĂ€ngern, bei der Smartphones als mobile Plattformen fĂŒr Kameras zum Einsatz kommen.
Kamera-Relokalisierung, bei der die Pose einer Kamera bezĂŒglich einer festen Umgebung bestimmt wird, ist ein wertvoller Prozess um eine Lösung oder UnterstĂŒtzung der Lokalisierung fĂŒr Fahrzeuge oder FuĂgĂ€nger darzustellen.
Kameras sind zudem kostengĂŒnstige Sensoren welche im Alltag von Menschen und Maschinen etabliert sind.
Die UnterstĂŒtzung von Kamera-Relokalisierung ist nicht auf Anwendungen bezĂŒglich der Navigation begrenzt, sondern kann allgemein zur UnterstĂŒtzung von Bildanalyse oder Bildverarbeitung wie Szenenrekonstruktion, Detektion, Klassifizierung oder Ă€hnlichen Anwendungen genutzt werden.
FĂŒr diese Zwecke, befasst sich diese Arbeit mit der Verbesserung des Prozesses der Kamera-Relokalisierung.
Da Convolutional Neural Networks (CNNs) und hybride Lösungen um die Posen von Kameras zu bestimmen in den letzten Jahren mit etablierten manuell entworfenen Methoden konkurrieren, ist der Fokus in dieser Thesis auf erstere Methoden gesetzt.
Die HauptbeitrĂ€ge dieser Arbeit beinhalten den Entwurf eines CNN zur SchĂ€tzung von Kameraposen, wobei der Schwerpunkt auf einer flachen Architektur liegt, die den Anforderungen an mobile Plattformen genĂŒgt.
Dieses Netzwerk erreicht Genauigkeiten in gleichem Grad wie tiefere CNNs mit umfangreicheren ModelgröĂen.
Desweiteren ist die Performanz von CNNs stark von der QuantitĂ€t und QualitĂ€t der zugrundeliegenden Trainingsdaten, die fĂŒr die Optimierung genutzt werden, abhĂ€ngig.
Daher, befassen sich die weiteren BeitrÀge dieser Thesis mit dem Rendern von Bildern und Bild-zu-Bild Umwandlungen zur Erweiterung solcher Trainingsdaten. Das generelle Erweitern solcher Trainingsdaten wird Data Augmentation (DA) genannt.
FĂŒr das Rendern von Bildern zur nĂŒtzlichen Erweiterung von Trainingsdaten werden 3D Modelle genutzt.
Generative Adversarial Networks (GANs) dienen zur Bild-zu-Bild Umwandlung. WÀhrend das Rendern von Bildern die QuantitÀt in einem Bilddatensatz erhöht, verbessert die Bild-zu-Bild Umwandlung die QualitÀt dieser gerenderten Daten.
Experimente werden sowohl mit erweiterten DatensĂ€tzen aus gerenderten Bildern als auch mit umgewandelten Bildern durchgefĂŒhrt.
Beide AnsÀtze der DA tragen zur Verbesserung der Genauigkeit der Lokalisierung bei.
Somit werden in dieser Arbeit Kamera-Relokalisierung mit modernsten Methoden durch DA verbessert
Single View 3D Reconstruction using Deep Learning
One of the major challenges in the field of Computer Vision has been the reconstruction of a 3D object or scene from a single 2D image. While there are many notable examples, traditional methods for single view reconstruction often fail to generalise due to the presence of many brittle hand-crafted engineering solutions, limiting their applicability to real world problems. Recently, deep learning has taken over the field of Computer Vision and âlearning to reconstructâ has become the dominant technique for addressing the limitations of traditional methods when performing single view 3D reconstruction. Deep learning allows our reconstruction methods to learn generalisable image features and monocular cues that would otherwise be difficult to engineer through ad-hoc hand-crafted approaches. However, it can often be difficult to efficiently integrate the various 3D shape representations within the deep learning framework. In particular, 3D volumetric representations can be adapted to work with Convolutional Neural Networks, but they are computationally expensive and memory inefficient when using local convolutional layers. Also, the successful learning of generalisable feature representations for 3D reconstruction requires large amounts of diverse training data. In practice, this is challenging for 3D training data, as it entails a costly and time consuming manual data collection and annotation process. Researchers have attempted to address these issues by utilising self-supervised learning and generative modelling techniques, however these approaches often produce suboptimal results when compared with models trained on larger datasets. This thesis addresses several key challenges incurred when using deep learning for âlearning to reconstructâ 3D shapes from single view images. We observe that it is possible to learn a compressed representation for multiple categories of the 3D ShapeNet dataset, improving the computational and memory efficiency when working with 3D volumetric representations. To address the challenge of data acquisition, we leverage deep generative models to âhallucinateâ hidden or latent novel viewpoints for a given input image. Combining these images with depths estimated by a self-supervised depth estimator and the known camera properties, allowed us to reconstruct textured 3D point clouds without any ground truth 3D training data. Furthermore, we show that is is possible to improve upon the previous self-supervised monocular depth estimator by adding a self-attention and a discrete volumetric representation, significantly improving accuracy on the KITTI 2015 dataset and enabling the estimation of uncertainty depth predictions.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 202
Synthetic image generation and the use of virtual environments for image enhancement tasks
Deep learning networks are often difficult to train if there are insufficient image samples. Gathering real-world images tailored for a specific job takes a lot of work to perform. This dissertation explores techniques for synthetic image generation and virtual environments for various image enhancement/ correction/restoration tasks, specifically distortion correction, dehazing, shadow removal, and intrinsic image decomposition. First, given various image formation equations, such as those used in distortion correction and dehazing, synthetic image samples can be produced, provided that the equation is well-posed. Second, using virtual environments to train various image models is applicable for simulating real-world effects that are otherwise difficult to gather or replicate, such as dehazing and shadow removal. Given synthetic images, one cannot train a network directly on it as there is a possible gap between the synthetic and real domains. We have devised several techniques for generating synthetic images and formulated domain adaptation methods where our trained deep-learning networks perform competitively in distortion correction, dehazing, and shadow removal. Additional studies and directions are provided for the intrinsic image decomposition problem and the exploration of procedural content generation, where a virtual Philippine city was created as an initial prototype.
Keywords: image generation, image correction, image dehazing, shadow removal, intrinsic image decomposition, computer graphics, rendering, machine learning, neural networks, domain adaptation, procedural content generation
- âŠ