653 research outputs found
Neural Radiance Fields: Past, Present, and Future
The various aspects like modeling and interpreting 3D environments and
surroundings have enticed humans to progress their research in 3D Computer
Vision, Computer Graphics, and Machine Learning. An attempt made by Mildenhall
et al in their paper about NeRFs (Neural Radiance Fields) led to a boom in
Computer Graphics, Robotics, Computer Vision, and the possible scope of
High-Resolution Low Storage Augmented Reality and Virtual Reality-based 3D
models have gained traction from res with more than 1000 preprints related to
NeRFs published. This paper serves as a bridge for people starting to study
these fields by building on the basics of Mathematics, Geometry, Computer
Vision, and Computer Graphics to the difficulties encountered in Implicit
Representations at the intersection of all these disciplines. This survey
provides the history of rendering, Implicit Learning, and NeRFs, the
progression of research on NeRFs, and the potential applications and
implications of NeRFs in today's world. In doing so, this survey categorizes
all the NeRF-related research in terms of the datasets used, objective
functions, applications solved, and evaluation criteria for these applications.Comment: 413 pages, 9 figures, 277 citation
Orientation Analysis in 4D Light Fields
This work is about the analysis of 4D light fields. In the context of this work a light
field is a series of 2D digital images of a scene captured on a planar regular grid of
camera positions. It is essential that the scene is captured over several camera positions
having constant distances to each other. This results in a sampling of light rays emitted
by a single scene point as a function of the camera position. In contrast to traditional
images – measuring the light intensity in the spatial domain – this approach additionally captures directional information leading to the four dimensionality mentioned above.
For image processing, light fields are a relatively new research area. In computer graphics,
they were used to avoid the work-intensive modeling of 3D geometry by instead using
view interpolation to achieve interactive 3D experiences without explicit geometry. The
intention of this work is vice versa, namely using light fields to reconstruct geometry of
a captured scene. The reason is that light fields provide much richer information content
compared to existing approaches of 3D reconstruction. Due to the regular and dense
sampling of the scene, aside from geometry, material properties are also imaged. Surfaces
whose visual appearance change when changing the line of sight causes problems for
known approaches of passive 3D reconstruction. Light fields instead sample this change
in appearance and thus make analysis possible.
This thesis covers different contributions. We propose a new approach to convert raw
data from a light field camera (plenoptic camera 2.0) to a 4D representation without
a pre-computation of pixel-wise depth. This special representation – also called the
Lumigraph – enables an access to epipolar planes which are sub-spaces of the 4D data
structure. An approach is proposed analyzing these epipolar plane images to achieve a
robust depth estimation on Lambertian surfaces. Based on this, an extension is presented
also handling reflective and transparent surfaces. As examples for the usefulness of this
inherently available depth information we show improvements to well known techniques
like super-resolution and object segmentation when extending them to light fields.
Additionally a benchmark database was established over time during the research for
this thesis. We will test the proposed approaches using this database and hope that it
helps to drive future research in this field
통합시스템을이용한 다시점스테레오 매칭과영상복원
학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2017. 2. 이경무.Estimating camera pose and scene structures from seriously degraded images is challenging problem. Most existing multi-view stereo algorithms assume high-quality input images and therefore have unreliable results for blurred, noisy, or low-resolution images. Experimental results show that the approach of using off-the-shelf image reconstruction algorithms as independent preprocessing is generally ineffective or even sometimes counterproductive. This is because naive frame-wise image reconstruction methods fundamentally ignore the consistency between images, although they seem to produce visually plausible results.
In this thesis, from the fact that image reconstruction and multi-view stereo problems are interrelated, we present a unified framework to solve these problems jointly. The validity of this approach is empirically verified for four different problems, dense depth map reconstruction, camera pose estimation, super-resolution, and deblurring from images obtained by a single moving camera. By reflecting the physical imaging process, we cast our objective into a cost minimization problem, and solve the solution using alternate optimization techniques. Experiments show that the proposed method can restore high-quality depth maps from seriously degraded images for both synthetic and real video, as opposed to the failure of simple multi-view stereo methods. Our algorithm also produces superior super-resolution and deblurring results compared to simple preprocessing with conventional super-resolution and deblurring techniques.
Moreover, we show that the proposed framework can be generalized to handle more common scenarios. First, it can solve image reconstruction and multi-view stereo problems for multi-view single-shot images captured by a light field camera. By using information of calibrated multi-view images, it recovers the motions of individual objects in the input image as well as the unknown camera motion during the shutter time.
The contribution of this thesis is proposing a new perspective on the solution of the existing computer vision problems from an integrated viewpoint. We show that by solving interrelated problems jointly, we can obtain physically more plausible solution and better performance, especially when input images are challenging. The proposed optimization algorithm also makes our algorithm more practical in terms of computational complexity.1 Introduction 1
1.1 Outline of Dissertation 2
2 Background 5
3 Generalized Imaging Model 9
3.1 Camera Projection Model 9
3.2 Depth and Warping Operation 11
3.3 Representation of Camera Pose in SE(3) 12
3.4 Proposed Imaging Model 12
4 Rendering Synthetic Datasets 17
4.1 Making Blurred Image Sequences using Depth-based Image Rendering 18
4.2 Making Blurred Image Sequences using Blender 18
5 A Unified Framework for Single-shot Multi-view Images 21
5.1 Introduction 21
5.2 Related Works 24
5.3 Deblurring with 4D Light Fields 27
5.3.1 Motion Blur Formulation in Light Fields 27
5.3.2 Initialization 28
5.4 Joint Estimation 30
5.4.1 Energy Formulation 30
5.4.2 Update Latent Image 31
5.4.3 Update Camera Pose and Depth map 33
5.5 Experimental Results 34
5.5.1 Synthetic Data 34
5.5.2 Real Data 36
5.6 Conclusion 37
6 A Unified Framework for a Monocular Image Sequence 41
6.1 Introduction 41
6.2 Related Works 44
6.3 Modeling Imaging Process 46
6.4 Unified Energy Formulation 47
6.4.1 Matching term 47
6.4.2 Self-consistency term 48
6.4.3 Regularization term 49
6.5 Optimization 50
6.5.1 Update of the depth maps and camera poses 51
6.5.2 Update of the latent images . 52
6.5.3 Initialization 53
6.5.4 Occlusion Handling 54
6.6 Experimental Results 54
6.6.1 Synthetic datasets 55
6.6.2 Real datasets 61
6.6.3 The effect of parameters 65
6.7 Conclusion 66
7 A Unified Framework for SLAM 69
7.1 Motivation 69
7.2 Baseline 70
7.3 Proposed Method 72
7.4 Experimental Results 73
7.4.1 Quantitative comparison 73
7.4.2 Qualitative results 77
7.4.3 Runtime 79
7.5 Conclusion 80
8 Conclusion 83
8.1 Summary and Contribution of the Dissertation 83
8.2 Future Works 84
Bibliography 86
초록 94Docto
Self-Supervised Learning for Geometry
This thesis focuses on two fundamental problems in robotic vision, scene geometry understanding and camera tracking. While both tasks have been the subject of research in robotic vision, numerous geometric solutions have been proposed in the past decades. In this thesis, we cast the geometric problems as machine learning problems, specifically, deep learning problems. Differ from conventional supervised learning methods that using expensive annotations as the supervisory signal, we advocate for the use of geometry as a supervisory signal to improve the perceptual capabilities in robots, namely Geometry Self-supervision. With the geometry self-supervision, we allow robots to learn and infer the 3D structure of the scene and ego-motion by watching videos, instead of expensive ground-truth annotation in traditional supervised learning problems. Followed by showing the use of geometry for deep learning, we show the possibilities of integrating self-supervised models with traditional geometry-based methods as a hybrid solution for solving the mapping and tracking problem. We focus on an end-to-end mapping problem from stereo data in the first part of this thesis, namely Deep Stereo Matching. Stereo matching is one of the oldest problems in computer vision. Classical approaches to stereo matching typically rely on handcrafted features and a multiple-step solution. Recent deep learning methods utilize deep neural networks to achieve end-to-end trained approaches while significantly outperforming classic methods. We propose a novel data acquisition pipeline using an untethered device (Microsoft HoloLens) with a Time-of-Flight (ToF) depth camera and stereo cameras to collect real-world data. A novel semi-supervised method is proposed to train networks with ground-truth supervision and self-supervision. The large scale real-world stereo dataset with semi-dense annotation and dense self-supervision allow our deep stereo matching network to generalize better when compared to prior arts. Mapping and tracking using a single camera (Monocular) is a harder problem when compared to that using a stereo camera due to varies well-known challenges. In the second part of this thesis, We decouple the problem into single view depth estimation (mapping) and two view visual odometry (tracking) and propose a self-supervised framework, namely SelfTAM, which jointly learns the depth estimator and the odometry estimator. The self-supervised problem is usually formulated as an energy minimization problem consist of an energy of data consistency in multi-view (e.g. photometric) and an energy of prior regularization (e.g. depth smoothness prior). We strengthen the supervision signal with a deep feature consistency energy term and a surface normal regularization term. Though our method trains models with stereo sequence such that a real-world scaling factor is naturally incorporated, only monocular data is required in the inference stage. In the last part of this thesis, we revisit the basics of visual odometry and explore the best practice to integrate deep learning models with geometry-based visual odometry methods. A robust visual odometry system, DF-VO, is proposed. We use deep networks to establish 2D-2D/3D-2D correspondences and pick the best correspondences from the dense predictions. Feeding the high-quality correspondences into traditional VO methods, e.g. Epipolar Geometry and Prospective-n-Points, we can solve visual odometry problem within a more robust framework. With the proposed self-supervised training, we can even allow the models to perform online adaptation in the run-time and take a step toward a lifelong learning visual odometry system.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 202
Die Virtuelle Videokamera: ein System zur Blickpunktsynthese in beliebigen, dynamischen Szenen
The Virtual Video Camera project strives to create free viewpoint video from casually captured multi-view data. Multiple video streams of a dynamic scene are captured with off-the-shelf camcorders, and the user can re-render the scene from novel perspectives. In this thesis the algorithmic core of the Virtual Video Camera is presented. This includes the algorithm
for image correspondence estimation as well as the image-based renderer. Furthermore, its application in the context of an actual video production is showcased, and the rendering and image processing pipeline is extended to incorporate depth information.Das Virtual Video Camera Projekt dient der Erzeugung von Free Viewpoint Video Ansichten von Multi-View Aufnahmen: Material mehrerer Videoströme wird hierzu mit handelsüblichen Camcordern aufgezeichnet. Im Anschluss kann die Szene aus beliebigen, von den ursprünglichen Kameras nicht abgedeckten Blickwinkeln betrachtet werden. In dieser Dissertation
wird der algorithmische Kern der Virtual Video Camera vorgestellt. Dies beinhaltet das Verfahren zur Bildkorrespondenzschätzung sowie den bildbasierten Renderer. Darüber hinaus wird die Anwendung im Kontext einer Videoproduktion beleuchtet. Dazu wird die bildbasierte Erzeugung neuer Blickpunkte um die Erzeugung und Einbindung von Tiefeninformationen
erweitert
Image Restoration
This book represents a sample of recent contributions of researchers all around the world in the field of image restoration. The book consists of 15 chapters organized in three main sections (Theory, Applications, Interdisciplinarity). Topics cover some different aspects of the theory of image restoration, but this book is also an occasion to highlight some new topics of research related to the emergence of some original imaging devices. From this arise some real challenging problems related to image reconstruction/restoration that open the way to some new fundamental scientific questions closely related with the world we interact with
Multi-task near-field perception for autonomous driving using surround-view fisheye cameras
Die Bildung der Augen führte zum Urknall der Evolution. Die Dynamik änderte sich von einem primitiven Organismus, der auf den Kontakt mit der Nahrung wartete, zu einem Organismus, der durch visuelle Sensoren gesucht wurde. Das menschliche Auge ist eine der raffiniertesten Entwicklungen der Evolution, aber es hat immer noch Mängel. Der Mensch hat über Millionen von Jahren einen biologischen Wahrnehmungsalgorithmus entwickelt, der in der Lage ist, Autos zu fahren, Maschinen zu bedienen, Flugzeuge zu steuern und Schiffe zu navigieren. Die Automatisierung dieser Fähigkeiten für Computer ist entscheidend für verschiedene Anwendungen, darunter selbstfahrende Autos, Augmented Realität und architektonische Vermessung. Die visuelle Nahfeldwahrnehmung im Kontext von selbstfahrenden Autos kann die Umgebung in einem Bereich von 0 - 10 Metern und 360° Abdeckung um das Fahrzeug herum wahrnehmen. Sie ist eine entscheidende Entscheidungskomponente bei der Entwicklung eines sichereren automatisierten Fahrens. Jüngste Fortschritte im Bereich Computer Vision und Deep Learning in Verbindung mit hochwertigen Sensoren wie Kameras und LiDARs haben ausgereifte Lösungen für die visuelle Wahrnehmung hervorgebracht. Bisher stand die Fernfeldwahrnehmung im Vordergrund. Ein weiteres wichtiges Problem ist die begrenzte Rechenleistung, die für die Entwicklung von Echtzeit-Anwendungen zur Verfügung steht. Aufgrund dieses Engpasses kommt es häufig zu einem Kompromiss zwischen Leistung und Laufzeiteffizienz. Wir konzentrieren uns auf die folgenden Themen, um diese anzugehen: 1) Entwicklung von Nahfeld-Wahrnehmungsalgorithmen mit hoher Leistung und geringer Rechenkomplexität für verschiedene visuelle Wahrnehmungsaufgaben wie geometrische und semantische Aufgaben unter Verwendung von faltbaren neuronalen Netzen. 2) Verwendung von Multi-Task-Learning zur Überwindung von Rechenengpässen durch die gemeinsame Nutzung von initialen Faltungsschichten zwischen den Aufgaben und die Entwicklung von Optimierungsstrategien, die die Aufgaben ausbalancieren.The formation of eyes led to the big bang of evolution. The dynamics changed from a primitive organism waiting for the food to come into contact for eating food being sought after by visual sensors. The human eye is one of the most sophisticated developments of evolution, but it still has defects. Humans have evolved a biological perception algorithm capable of driving cars, operating machinery, piloting aircraft, and navigating ships over millions of years. Automating these capabilities for computers is critical for various applications, including self-driving cars, augmented reality, and architectural surveying. Near-field visual perception in the context of self-driving cars can perceive the environment in a range of 0 - 10 meters and 360° coverage around the vehicle. It is a critical decision-making component in the development of safer automated driving. Recent advances in computer vision and deep learning, in conjunction with high-quality sensors such as cameras and LiDARs, have fueled mature visual perception solutions. Until now, far-field perception has been the primary focus. Another significant issue is the limited processing power available for developing real-time applications. Because of this bottleneck, there is frequently a trade-off between performance and run-time efficiency. We concentrate on the following issues in order to address them: 1) Developing near-field perception algorithms with high performance and low computational complexity for various visual perception tasks such as geometric and semantic tasks using convolutional neural networks. 2) Using Multi-Task Learning to overcome computational bottlenecks by sharing initial convolutional layers between tasks and developing optimization strategies that balance tasks
Super-resolution of 3-dimensional scenes
Super-resolution is an image enhancement method that increases the resolution of images and video. Previously this technique could only be applied to 2D scenes. The super-resolution algorithm developed in this thesis creates high-resolution views of 3-dimensional scenes, using low-resolution images captured from varying, unknown positions
- …