5,917 research outputs found

    Multiple-view Video Coding Using Depth Map in Projective Space

    Full text link
    Abstract. In this paper a new video coding by using multiple uncal-ibrated cameras is proposed. We consider the redundancy between the cameras view points and efficiently compress based on a depth map. Since our target videos are taken with uncalibrated cameras, our depth map is computed not in the real world but in the Projective Space. This is a virtual space defined by projective reconstruction of two still images. This means that the position in the space is correspondence to the depth value. Therefore we do not require full-calibration of the cameras. Gen-erating the depth map requires finding the correspondence between the cameras. We use a “plane sweep ” algorithm for it. Our method needs only a depth map except for the original base image and the camera parameters, and it contributes to the effectiveness of compression.

    Deep Projective 3D Semantic Segmentation

    Full text link
    Semantic segmentation of 3D point clouds is a challenging problem with numerous real-world applications. While deep learning has revolutionized the field of image semantic segmentation, its impact on point cloud data has been limited so far. Recent attempts, based on 3D deep learning approaches (3D-CNNs), have achieved below-expected results. Such methods require voxelizations of the underlying point cloud data, leading to decreased spatial resolution and increased memory consumption. Additionally, 3D-CNNs greatly suffer from the limited availability of annotated datasets. In this paper, we propose an alternative framework that avoids the limitations of 3D-CNNs. Instead of directly solving the problem in 3D, we first project the point cloud onto a set of synthetic 2D-images. These images are then used as input to a 2D-CNN, designed for semantic segmentation. Finally, the obtained prediction scores are re-projected to the point cloud to obtain the segmentation results. We further investigate the impact of multiple modalities, such as color, depth and surface normals, in a multi-stream network architecture. Experiments are performed on the recent Semantic3D dataset. Our approach sets a new state-of-the-art by achieving a relative gain of 7.9 %, compared to the previous best approach.Comment: Submitted to CAIP 201

    Acquisition, compression and rendering of depth and texture for multi-view video

    Get PDF
    Three-dimensional (3D) video and imaging technologies is an emerging trend in the development of digital video systems, as we presently witness the appearance of 3D displays, coding systems, and 3D camera setups. Three-dimensional multi-view video is typically obtained from a set of synchronized cameras, which are capturing the same scene from different viewpoints. This technique especially enables applications such as freeviewpoint video or 3D-TV. Free-viewpoint video applications provide the feature to interactively select and render a virtual viewpoint of the scene. A 3D experience such as for example in 3D-TV is obtained if the data representation and display enable to distinguish the relief of the scene, i.e., the depth within the scene. With 3D-TV, the depth of the scene can be perceived using a multi-view display that renders simultaneously several views of the same scene. To render these multiple views on a remote display, an efficient transmission, and thus compression of the multi-view video is necessary. However, a major problem when dealing with multiview video is the intrinsically large amount of data to be compressed, decompressed and rendered. We aim at an efficient and flexible multi-view video system, and explore three different aspects. First, we develop an algorithm for acquiring a depth signal from a multi-view setup. Second, we present efficient 3D rendering algorithms for a multi-view signal. Third, we propose coding techniques for 3D multi-view signals, based on the use of an explicit depth signal. This motivates that the thesis is divided in three parts. The first part (Chapter 3) addresses the problem of 3D multi-view video acquisition. Multi-view video acquisition refers to the task of estimating and recording a 3D geometric description of the scene. A 3D description of the scene can be represented by a so-called depth image, which can be estimated by triangulation of the corresponding pixels in the multiple views. Initially, we focus on the problem of depth estimation using two views, and present the basic geometric model that enables the triangulation of corresponding pixels across the views. Next, we review two calculation/optimization strategies for determining corresponding pixels: a local and a one-dimensional optimization strategy. Second, to generalize from the two-view case, we introduce a simple geometric model for estimating the depth using multiple views simultaneously. Based on this geometric model, we propose a new multi-view depth-estimation technique, employing a one-dimensional optimization strategy that (1) reduces the noise level in the estimated depth images and (2) enforces consistent depth images across the views. The second part (Chapter 4) details the problem of multi-view image rendering. Multi-view image rendering refers to the process of generating synthetic images using multiple views. Two different rendering techniques are initially explored: a 3D image warping and a mesh-based rendering technique. Each of these methods has its limitations and suffers from either high computational complexity or low image rendering quality. As a consequence, we present two image-based rendering algorithms that improves the balance on the aforementioned issues. First, we derive an alternative formulation of the relief texture algorithm which was extented to the geometry of multiple views. The proposed technique features two advantages: it avoids rendering artifacts ("holes") in the synthetic image and it is suitable for execution on a standard Graphics Processor Unit (GPU). Second, we propose an inverse mapping rendering technique that allows a simple and accurate re-sampling of synthetic pixels. Experimental comparisons with 3D image warping show an improvement of rendering quality of 3.8 dB for the relief texture mapping and 3.0 dB for the inverse mapping rendering technique. The third part concentrates on the compression problem of multi-view texture and depth video (Chapters 5–7). In Chapter 5, we extend the standard H.264/MPEG-4 AVC video compression algorithm for handling the compression of multi-view video. As opposed to the Multi-view Video Coding (MVC) standard that encodes only the multi-view texture data, the proposed encoder peforms the compression of both the texture and the depth multi-view sequences. The proposed extension is based on exploiting the correlation between the multiple camera views. To this end, two different approaches for predictive coding of views have been investigated: a block-based disparity-compensated prediction technique and a View Synthesis Prediction (VSP) scheme. Whereas VSP relies on an accurate depth image, the block-based disparity-compensated prediction scheme can be performed without any geometry information. Our encoder adaptively selects the most appropriate prediction scheme using a rate-distortion criterion for an optimal prediction-mode selection. We present experimental results for several texture and depth multi-view sequences, yielding a quality improvement of up to 0.6 dB for the texture and 3.2 dB for the depth, when compared to solely performing H.264/MPEG-4AVC disparitycompensated prediction. Additionally, we discuss the trade-off between the random-access to a user-selected view and the coding efficiency. Experimental results illustrating and quantifying this trade-off are provided. In Chapter 6, we focus on the compression of a depth signal. We present a novel depth image coding algorithm which concentrates on the special characteristics of depth images: smooth regions delineated by sharp edges. The algorithm models these smooth regions using parameterized piecewiselinear functions and sharp edges by a straight line, so that it is more efficient than a conventional transform-based encoder. To optimize the quality of the coding system for a given bit rate, a special global rate-distortion optimization balances the rate against the accuracy of the signal representation. For typical bit rates, i.e., between 0.01 and 0.25 bit/pixel, experiments have revealed that the coder outperforms a standard JPEG-2000 encoder by 0.6-3.0 dB. Preliminary results were published in the Proceedings of 26th Symposium on Information Theory in the Benelux. In Chapter 7, we propose a novel joint depth-texture bit-allocation algorithm for the joint compression of texture and depth images. The described algorithm combines the depth and texture Rate-Distortion (R-D) curves, to obtain a single R-D surface that allows the optimization of the joint bit-allocation in relation to the obtained rendering quality. Experimental results show an estimated gain of 1 dB compared to a compression performed without joint bit-allocation optimization. Besides this, our joint R-D model can be readily integrated into an multi-view H.264/MPEG-4 AVC coder because it yields the optimal compression setting with a limited computation effort

    Surface Reconstruction and Evolution from Multiple Views

    Get PDF
    Applications like 3D Telepresence necessitate faithful 3D surface reconstruction of the object and 3D data compression in both spatial and temporal domains. This makes us feel immersed in virtual environments there by making 3D Telepresence a powerful tool in many applications. Hence 3D surface reconstruction and 3D compression are two challenging problems which are addressed in this thesis

    Disparity compensation using geometric transforms

    Get PDF
    This dissertation describes the research and development of some techniques to enhance the disparity compensation in 3D video compression algorithms. Disparity compensation is usually performed using a block matching technique between views, disregarding the various levels of disparity present for objects at different depths in the scene. An alternative coding scheme is proposed, taking advantage of the cameras setup information and the object’s depth in the scene, to compensate more complex spatial distortions, being able to improve disparity compensation even with convergent cameras. In order to perform a more accurate disparity compensation, the reference picture list is enriched with additional geometrically transformed images, for the most relevant object’s levels of depth in the scene, resulting from projections of one view to another. This scheme can be implemented in any state-of-the-art video codec, as H.264/AVC or HEVC, in order to improve the disparity matching accuracy between views. Experimental results, using MV-HEVC extension, show the efficiency of the proposed method for coding stereo video, presenting bitrate savings up to 2.87%, for convergent camera sequences, and 1.52% for parallel camera sequences. Also a method to choose the geometrically transformed inter view reference pictures was developed, in order to reduce unnecessary overhead for unused reference pictures. By selecting and adding to the reference picture list, only the most useful pictures, all results improved, presenting bitrate savings up to 3.06% for convergent camera sequences, and 2% for parallel camera sequences

    Acquiring 3D scene information from 2D images

    Get PDF
    In recent years, people are becoming increasingly acquainted with 3D technologies such as 3DTV, 3D movies and 3D virtual navigation of city environments in their daily life. Commercial 3D movies are now commonly available for consumers. Virtual navigation of our living environment as used on a personal computer has become a reality due to well-known web-based geographic applications using advanced imaging technologies. To enable such 3D applications, many technological challenges such as 3D content creation, 3D displaying technology and 3D content transmission need to tackled and deployed at low cost. This thesis concentrates on the reconstruction of 3D scene information from multiple 2D images, aiming for an automatic and low-cost production of the 3D content. In this thesis, two multiple-view 3D reconstruction systems are proposed: a 3D modeling system for reconstructing the sparse 3D scene model from long video sequences captured with a hand-held consumer camcorder, and a depth reconstruction system for creating depth maps from multiple-view videos taken by multiple synchronized cameras. Both systems are designed to compute the 3D scene information in an automated way with minimum human interventions, in order to reduce the production cost of 3D contents. Experimental results on real videos of hundreds and thousands frames have shown that the two systems are able to accurately and automatically reconstruct the 3D scene information from 2D image data. The findings of this research are useful for emerging 3D applications such as 3D games, 3D visualization and 3D content production. Apart from designing and implementing the two proposed systems, we have developed three key scientific contributions to enable the two proposed 3D reconstruction systems. The first contribution is that we have designed a novel feature point matching algorithm that uses only a smoothness constraint for matching the points, which states that neighboring feature points in images tend to move with similar directions and magnitudes. The employed smoothness assumption is not only valid but also robust for most images with limited image motion, regardless of the camera motion and scene structure. Because of this, the algorithm obtains two major advan- 1 tages. First, the algorithm is robust to illumination changes, as the employed smoothness constraint does not rely on any texture information. Second, the algorithm has a good capability to handle the drift of the feature points over time, as the drift can hardly lead to a violation of the smoothness constraint. This leads to the large number of feature points matched and tracked by the proposed algorithm, which significantly helps the subsequent 3D modeling process. Our feature point matching algorithm is specifically designed for matching and tracking feature points in image/video sequences where the image motion is limited. Our extensive experimental results show that the proposed algorithm is able to track at least 2.5 times as many feature points compared with the state-of-the-art algorithms, with a comparable or higher accuracy. This contributes significantly to the robustness of the 3D reconstruction process. The second contribution is that we have developed algorithms to detect critical configurations where the factorization-based 3D reconstruction degenerates. Based on the detection, we have proposed a sequence-dividing algorithm to divide a long sequence into subsequences, such that successful 3D reconstructions can be performed on individual subsequences with a high confidence. The partial reconstructions are merged later to obtain the 3D model of the complete scene. In the critical configuration detection algorithm, the four critical configurations are detected: (1) coplanar 3D scene points, (2) pure camera rotation, (3) rotation around two camera centers, and (4) presence of excessive noise and outliers in the measurements. The configurations in cases (1), (2) and (4) will affect the rank of the Scaled Measurement Matrix (SMM). The number of camera centers in case (3) will affect the number of independent rows of the SMM. By examining the rank and the row space of the SMM, the abovementioned critical configurations are detected. Based on the detection results, the proposed sequence-dividing algorithm divides a long sequence into subsequences, such that each subsequence is free of the four critical configurations in order to obtain successful 3D reconstructions on individual subsequences. Experimental results on both synthetic and real sequences have demonstrated that the above four critical configurations are robustly detected, and a long sequence of thousands frames is automatically divided into subsequences, yielding successful 3D reconstructions. The proposed critical configuration detection and sequence-dividing algorithms provide an essential processing block for an automatical 3D reconstruction on long sequences. The third contribution is that we have proposed a coarse-to-fine multiple-view depth labeling algorithm to compute depth maps from multiple-view videos, where the accuracy of resulting depth maps is gradually refined in multiple optimization passes. In the proposed algorithm, multiple-view depth reconstruction is formulated as an image-based labeling problem using the framework of Maximum A Posterior (MAP) on Markov Random Fields (MRF). The MAP-MRF framework allows the combination of various objective and heuristic depth cues to define the local penalty and the interaction energies, which provides a straightforward and computationally tractable formulation. Furthermore, the global optimal MAP solution to depth labeli ing can be found by minimizing the local energies, using existing MRF optimization algorithms. The proposed algorithm contains the following three key contributions. (1) A graph construction algorithm to proposed to construct triangular meshes on over-segmentation maps, in order to exploit the color and the texture information for depth labeling. (2) Multiple depth cues are combined to define the local energies. Furthermore, the local energies are adapted to the local image content, in order to consider the varying nature of the image content for an accurate depth labeling. (3) Both the density of the graph nodes and the intervals of the depth labels are gradually refined in multiple labeling passes. By doing so, both the computational efficiency and the robustness of the depth labeling process are improved. The experimental results on real multiple-view videos show that the depth maps of for selected reference view are accurately reconstructed. Depth discontinuities are very well preserved

    Survey of image-based representations and compression techniques

    Get PDF
    In this paper, we survey the techniques for image-based rendering (IBR) and for compressing image-based representations. Unlike traditional three-dimensional (3-D) computer graphics, in which 3-D geometry of the scene is known, IBR techniques render novel views directly from input images. IBR techniques can be classified into three categories according to how much geometric information is used: rendering without geometry, rendering with implicit geometry (i.e., correspondence), and rendering with explicit geometry (either with approximate or accurate geometry). We discuss the characteristics of these categories and their representative techniques. IBR techniques demonstrate a surprising diverse range in their extent of use of images and geometry in representing 3-D scenes. We explore the issues in trading off the use of images and geometry by revisiting plenoptic-sampling analysis and the notions of view dependency and geometric proxies. Finally, we highlight compression techniques specifically designed for image-based representations. Such compression techniques are important in making IBR techniques practical.published_or_final_versio
    corecore