46 research outputs found
MP-MVS: Multi-Scale Windows PatchMatch and Planar Prior Multi-View Stereo
Significant strides have been made in enhancing the accuracy of Multi-View
Stereo (MVS)-based 3D reconstruction. However, untextured areas with unstable
photometric consistency often remain incompletely reconstructed. In this paper,
we propose a resilient and effective multi-view stereo approach (MP-MVS). We
design a multi-scale windows PatchMatch (mPM) to obtain reliable depth of
untextured areas. In contrast with other multi-scale approaches, which is
faster and can be easily extended to PatchMatch-based MVS approaches.
Subsequently, we improve the existing checkerboard sampling schemes by limiting
our sampling to distant regions, which can effectively improve the efficiency
of spatial propagation while mitigating outlier generation. Finally, we
introduce and improve planar prior assisted PatchMatch of ACMP. Instead of
relying on photometric consistency, we utilize geometric consistency
information between multi-views to select reliable triangulated vertices. This
strategy can obtain a more accurate planar prior model to rectify photometric
consistency measurements. Our approach has been tested on the ETH3D High-res
multi-view benchmark with several state-of-the-art approaches. The results
demonstrate that our approach can reach the state-of-the-art. The associated
codes will be accessible at https://github.com/RongxuanTan/MP-MVS
H-Net: unsupervised attention-based stereo depth estimation leveraging epipolar geometry
Depth estimation from a stereo image pair has become one of the most explored applications in computer vision, with most previous methods relying on fully supervised learning settings. However, due to the difficulty in acquiring accurate and scalable ground truth data, the training of fully supervised methods is challenging. As an alternative, self-supervised methods are becoming more popular to mitigate this challenge. In this paper, we introduce the H-Net, a deep-learning framework for unsupervised stereo depth estimation that leverages epipolar geometry to refine stereo matching. For the first time, a Siamese autoencoder architecture is used for depth estimation which allows mutual information between rectified stereo images to be extracted. To enforce the epipolar constraint, the mutual epipolar attention mechanism has been designed which gives more emphasis to correspondences of features that lie on the same epipolar line while learning mutual information between the input stereo pair. Stereo correspondences are further enhanced by incorporating semantic information to the proposed attention mechanism. More specifically, the optimal transport algorithm is used to suppress attention and eliminate outliers in areas not visible in both cameras. Extensive experiments on KITTI2015 and Cityscapes show that the proposed modules are able to improve the performance of the unsupervised stereo depth estimation methods while closing the gap with the fully supervised approaches
Refined Equivalent Pinhole Model for Large-scale 3D Reconstruction from Spaceborne CCD Imagery
In this study, we present a large-scale earth surface reconstruction pipeline
for linear-array charge-coupled device (CCD) satellite imagery. While
mainstream satellite image-based reconstruction approaches perform
exceptionally well, the rational functional model (RFM) is subject to several
limitations. For example, the RFM has no rigorous physical interpretation and
differs significantly from the pinhole imaging model; hence, it cannot be
directly applied to learning-based 3D reconstruction networks and to more novel
reconstruction pipelines in computer vision. Hence, in this study, we introduce
a method in which the RFM is equivalent to the pinhole camera model (PCM),
meaning that the internal and external parameters of the pinhole camera are
used instead of the rational polynomial coefficient parameters. We then derive
an error formula for this equivalent pinhole model for the first time,
demonstrating the influence of the image size on the accuracy of the
reconstruction. In addition, we propose a polynomial image refinement model
that minimizes equivalent errors via the least squares method. The experiments
were conducted using four image datasets: WHU-TLC, DFC2019, ISPRS-ZY3, and GF7.
The results demonstrated that the reconstruction accuracy was proportional to
the image size. Our polynomial image refinement model significantly enhanced
the accuracy and completeness of the reconstruction, and achieved more
significant improvements for larger-scale images.Comment: 24 page
TOWARD 3D RECONSTRUCTION OF STATIC AND DYNAMIC OBJECTS
The goal of image-based 3D reconstruction is to construct a spatial understanding of the world from a collection of images. For applications that seek to model generic real-world scenes, it is important that the reconstruction methods used are able to characterize both static scene elements (e.g. trees and buildings) as well as dynamic objects (e.g. cars and pedestrians). However, due to many inherent ambiguities in the reconstruction problem, recovering this 3D information with accuracy, robustness, and efficiency is a considerable challenge. To advance the research frontier for image-based 3D modeling, this dissertation focuses on three challenging problems in static scene and dynamic object reconstruction. We first target the problem of static scene depthmap estimation from crowd-sourced datasets (i.e. photos collected from the Internet). While achieving high-quality depthmaps using images taken under a controlled environment is already a difficult task, heterogeneous crowd-sourced data presents a unique set of challenges for multi-view depth estimation, including varying illumination and occasional occlusions. We propose a depthmap estimation method that demonstrates high accuracy, robustness, and scalability on a large number of photos collected from the Internet. Compared to static scene reconstruction, the problem of dynamic object reconstruction from monocular images is fundamentally ambiguous when not imposing any additional assumptions. This is because having only a single observation of an object is insufficient for valid 3D triangulation, which typically requires concurrent observations of the object from multiple viewpoints. Assuming that dynamic objects of the same class (e.g. all the pedestrians walking on a sidewalk) move in a common path in the real world, we develop a method that estimates the 3D positions of the dynamic objects from unstructured monocular images. Experiments on both synthetic and real datasets illustrate the solvability of the problem and the effectiveness of our approach. Finally, we address the problem of dynamic object reconstruction from a set of unsynchronized videos capturing the same dynamic event. This problem is of great interest because, due to the increased availability of portable capture devices, captures using multiple unsynchronized videos are common in the real world. To resolve the challenges that arises from non-concurrent captures and unknown temporal overlap among video streams, we propose a self-expressive dictionary learning framework, where the dictionary entries are defined as the collection of temporally varying structures. Experiments demonstrate the effectiveness of this approach to the previously unsolved problem.Doctor of Philosoph
On the Synergies between Machine Learning and Binocular Stereo for Depth Estimation from Images: a Survey
Stereo matching is one of the longest-standing problems in computer vision
with close to 40 years of studies and research. Throughout the years the
paradigm has shifted from local, pixel-level decision to various forms of
discrete and continuous optimization to data-driven, learning-based methods.
Recently, the rise of machine learning and the rapid proliferation of deep
learning enhanced stereo matching with new exciting trends and applications
unthinkable until a few years ago. Interestingly, the relationship between
these two worlds is two-way. While machine, and especially deep, learning
advanced the state-of-the-art in stereo matching, stereo itself enabled new
ground-breaking methodologies such as self-supervised monocular depth
estimation based on deep networks. In this paper, we review recent research in
the field of learning-based depth estimation from single and binocular images
highlighting the synergies, the successes achieved so far and the open
challenges the community is going to face in the immediate future.Comment: Accepted to TPAMI. Paper version of our CVPR 2019 tutorial:
"Learning-based depth estimation from stereo and monocular images: successes,
limitations and future challenges"
(https://sites.google.com/view/cvpr-2019-depth-from-image/home
ClusterFusion: Real-time Relative Positioning and Dense Reconstruction for UAV Cluster
As robotics technology advances, dense point cloud maps are increasingly in
demand. However, dense reconstruction using a single unmanned aerial vehicle
(UAV) suffers from limitations in flight speed and battery power, resulting in
slow reconstruction and low coverage. Cluster UAV systems offer greater
flexibility and wider coverage for map building. Existing methods of cluster
UAVs face challenges with accurate relative positioning, scale drift, and
high-speed dense point cloud map generation. To address these issues, we
propose a cluster framework for large-scale dense reconstruction and real-time
collaborative localization. The front-end of the framework is an improved
visual odometry which can effectively handle large-scale scenes. Collaborative
localization between UAVs is enabled through a two-stage joint optimization
algorithm and a relative pose optimization algorithm, effectively achieving
accurate relative positioning of UAVs and mitigating scale drift. Estimated
poses are used to achieve real-time dense reconstruction and fusion of point
cloud maps. To evaluate the performance of our proposed method, we conduct
qualitative and quantitative experiments on real-world data. The results
demonstrate that our framework can effectively suppress scale drift and
generate large-scale dense point cloud maps in real-time, with the
reconstruction speed increasing as more UAVs are added to the system
Constraining Depth Map Geometry for Multi-View Stereo: A Dual-Depth Approach with Saddle-shaped Depth Cells
Learning-based multi-view stereo (MVS) methods deal with predicting accurate
depth maps to achieve an accurate and complete 3D representation. Despite the
excellent performance, existing methods ignore the fact that a suitable depth
geometry is also critical in MVS. In this paper, we demonstrate that different
depth geometries have significant performance gaps, even using the same depth
prediction error. Therefore, we introduce an ideal depth geometry composed of
Saddle-Shaped Cells, whose predicted depth map oscillates upward and downward
around the ground-truth surface, rather than maintaining a continuous and
smooth depth plane. To achieve it, we develop a coarse-to-fine framework called
Dual-MVSNet (DMVSNet), which can produce an oscillating depth plane.
Technically, we predict two depth values for each pixel (Dual-Depth), and
propose a novel loss function and a checkerboard-shaped selecting strategy to
constrain the predicted depth geometry. Compared to existing methods,DMVSNet
achieves a high rank on the DTU benchmark and obtains the top performance on
challenging scenes of Tanks and Temples, demonstrating its strong performance
and generalization ability. Our method also points to a new research direction
for considering depth geometry in MVS.Comment: Accepted by ICCV 202
Automatic 2D to Stereoscopic Video Conversion for 3DTV
In this thesis we address the problem of automatically converting a video filmed with a single camera to stereoscopic content tailored for viewing using 3D TVs. We present two techniques: (a) a non-parametric approach which does not require extensive training and produces good results for simple rigid scenes and, (b) a deep learning approach able to handle dynamic changes in the scene. The proposed solutions both include two stages: depth generation and rendering. For the first stage, for the non-parametric approach we utilize an energy-based optimization, and for the deep learning approach a multi-scale convolutional neural network to address the complex problem of depth estimation from a single image. Depth maps are generated based on the input RGB images. We reformulate and simplify the process of generating the virtual camera’s depth map and present how this can be used to render an anaglyph image. Anaglyph stereo was used for demonstration only because of the easy and wide availability of red/cyan glasses however, this does not limit the applicability of the proposed technique to other stereo forms. Finally, we have extensively tested the proposed approaches and present the results