17 research outputs found
Metrically Scaled Monocular Depth Estimation through Sparse Priors for Underwater Robots
In this work, we address the problem of real-time dense depth estimation from
monocular images for mobile underwater vehicles. We formulate a deep learning
model that fuses sparse depth measurements from triangulated features to
improve the depth predictions and solve the problem of scale ambiguity. To
allow prior inputs of arbitrary sparsity, we apply a dense parameterization
method. Our model extends recent state-of-the-art approaches to monocular image
based depth estimation, using an efficient encoder-decoder backbone and modern
lightweight transformer optimization stage to encode global context. The
network is trained in a supervised fashion on the forward-looking underwater
dataset, FLSea. Evaluation results on this dataset demonstrate significant
improvement in depth prediction accuracy by the fusion of the sparse feature
priors. In addition, without any retraining, our method achieves similar depth
prediction accuracy on a downward looking dataset we collected with a diver
operated camera rig, conducting a survey of a coral reef. The method achieves
real-time performance, running at 160 FPS on a laptop GPU and 7 FPS on a single
CPU core and is suitable for direct deployment on embedded systems. The
implementation of this work is made publicly available at
https://github.com/ebnerluca/uw_depth.Comment: Submitted to ICRA 202
Recommended from our members
High-quality dense stereo vision for whole body imaging and obesity assessment
textThe prevalence of obesity has necessitated developing safe and convenient tools for timely assessing and monitoring this condition for a broad range of population. Three-dimensional (3D) body imaging has become a new mean for obesity assessment. Moreover, it generates body shape information that is meaningful for fitness, ergonomics, and personalized clothing. In the previous work of our lab, we developed a prototype active stereo vision system that demonstrated a potential to fulfill this goal. But the prototype required four computer projectors to cast artificial textures on the body which facilitate the stereo-matching on texture-deficient images (e.g., skin). This decreases the mobility of the system when used to collect a large population data. In addition, the resolution of the generated 3D~images is limited by both cameras and projectors available during the project. The study reported in this dissertation highlights our continued effort in improving the capability of 3Dbody imaging through simplified hardware for passive stereo and advanced computation techniques.
The system utilizes high-resolution single-lens reflex (SLR) cameras, which became widely available lately, and is configured in a two-stance design to image the front and back surfaces of a person. A total of eight cameras are used to form four pairs of stereo units. Each unit covers a quarter of the body surface. The stereo units are individually calibrated with a specific pattern to determine cameras' intrinsic and extrinsic parameters for stereo matching. The global orientation and position of each stereo unit within a common world coordinate system is calculated through a 3Dregistration step. The stereo calibration and 3Dregistration procedures do not need to be repeated for a deployed system if the cameras' relative positions have not changed. This property contributes to the portability of the system, and tremendously alleviates the maintenance task. The image acquisition time is around two seconds for a whole-body capture. The system works in an indoor environment with a moderate ambient light.
Advanced stereo computation algorithms are developed by taking advantage of high-resolution images and by tackling the ambiguity problem in stereo matching. A multi-scale, coarse-to-fine matching framework is proposed to match large-scale textures at a low resolution and refine the matched results over higher resolutions. This matching strategy reduces the complexity of the computation and avoids ambiguous matching at the native resolution. The pixel-to-pixel stereo matching algorithm follows a classic, four-step strategy which consists of matching cost computation, cost aggregation, disparity computation and disparity refinement.
The system performance has been evaluated on mannequins and human subjects in comparison with other measurement methods. It was found that the geometrical measurements from reconstructed 3Dbody models, including body circumferences and whole volume, are highly repeatable and consistent with manual and other instrumental measurements (CV 0.99). The agreement of percent body fat (%BF) estimation on human subjects between stereo and dual-energy X-ray absorptiometry (DEXA) was found to be improved over the previous active stereo system, and the limits of agreement with 95% confidence were reduced by half. Our achieved %BF estimation agreement is among the lowest ones of other comparative studies with commercialized air displacement plethysmography (ADP) and DEXA. In practice, %BF estimation through a two-component model is sensitive to body volume measurement, and the estimation of lung volume could be a source of variation. Protocols for this type of measurement should still be created with an awareness of this factor.Biomedical Engineerin
Mapping and Deep Analysis of Image Dehazing: Coherent Taxonomy, Datasets, Open Challenges, Motivations, and Recommendations
Our study aims to review and analyze the most relevant studies in the image dehazing field. Many aspects have been deemed necessary to provide a broad understanding of various studies that have been examined through surveying the existing literature. These aspects are as follows: datasets that have been used in the literature, challenges that other researchers have faced, motivations, and recommendations for diminishing the obstacles in the reported literature. A systematic protocol is employed to search all relevant articles on image dehazing, with variations in keywords, in addition to searching for evaluation and benchmark studies. The search process is established on three online databases, namely, IEEE Xplore, Web of Science (WOS), and ScienceDirect (SD), from 2008 to 2021. These indices are selected because they are sufficient in terms of coverage. Along with definition of the inclusion and exclusion criteria, we include 152 articles to the final set. A total of 55 out of 152 articles focused on various studies that conducted image dehazing, and 13 out 152 studies covered most of the review papers based on scenarios and general overviews. Finally, most of the included articles centered on the development of image dehazing algorithms based on real-time scenario (84/152) articles. Image dehazing removes unwanted visual effects and is often considered an image enhancement technique, which requires a fully automated algorithm to work under real-time outdoor applications, a reliable evaluation method, and datasets based on different weather conditions. Many relevant studies have been conducted to meet these critical requirements. We conducted objective image quality assessment experimental comparison of various image dehazing algorithms. In conclusions unlike other review papers, our study distinctly reflects different observations on image dehazing areas. We believe that the result of this study can serve as a useful guideline for practitioners who are looking for a comprehensive view on image dehazing
Fusion of computed point clouds and integral-imaging concepts for full-parallax 3D display
During the last century, various technologies of 3D image capturing and visualization have spotlighted, due to both their pioneering nature and the aspiration to extend the applications of conventional 2D imaging technology to 3D scenes. Besides, thanks to advances in opto-electronic imaging technologies, the possibilities of capturing and transmitting 2D images in real-time have progressed significantly, and boosted the growth of 3D image capturing, processing, transmission and as well as display techniques. Among the latter, integral-imaging technology has been considered as one of the promising ones to restore real 3D scenes through the use of a multi-view visualization system that provides to observers with a sense of immersive depth. Many research groups and companies have researched this novel technique with different approaches, and occasions for various complements.
In this work, we followed this trend, but processed through our novel strategies and algorithms. Thus, we may say that our approach is innovative, when compared to conventional proposals. The main objective of our research is to develop techniques that allow recording and simulating the natural scene in 3D by using several cameras which have different types and characteristics. Then, we compose a dense 3D scene from the computed 3D data by using various methods and techniques. Finally, we provide a volumetric scene which is restored with great similarity to the original shape, through a comprehensive 3D monitor and/or display system. Our Proposed integral-imaging monitor shows an immersive experience to multiple observers.
In this thesis we address the challenges of integral image production techniques based on the computerized 3D information, and we focus in particular on the implementation of full-parallax 3D display system. We have also made progress in overcoming the limitations of the conventional integral-imaging technique. In addition, we have developed different refinement methodologies and restoration strategies for the composed depth information. Finally, we have applied an adequate solution that reduces the computation times significantly, associated with the repetitive calculation phase in the generation of an integral image. All these results are presented by the corresponding images and proposed display experiments
Exploiting Spatio-Temporal Coherence for Video Object Detection in Robotics
This paper proposes a method to enhance video object detection for indoor environments in robotics. Concretely, it exploits knowledge about the camera motion between frames to propagate previously detected objects to successive frames. The proposal is rooted in the concepts of planar homography to propose regions of interest where to find objects, and recursive Bayesian filtering to integrate observations over time. The proposal is evaluated on six virtual, indoor environments, accounting for the detection of nine object classes over a total of ∼ 7k frames. Results show that our proposal improves the recall and the F1-score by a factor of 1.41 and 1.27, respectively, as well as it achieves a significant reduction of the object categorization entropy (58.8%) when compared to a two-stage video object detection method used as baseline, at the cost of small time overheads (120 ms) and precision loss (0.92).</p
Probabilistic modeling for single-photon lidar
Lidar is an increasingly prevalent technology for depth sensing, with applications including scientific measurement and autonomous navigation systems. While conventional systems require hundreds or thousands of photon detections per pixel to form accurate depth and reflectivity images, recent results for single-photon lidar (SPL) systems using single-photon avalanche diode (SPAD) detectors have shown accurate images formed from as little as one photon detection per pixel, even when half of those detections are due to uninformative ambient light. The keys to such photon-efficient image formation are two-fold: (i) a precise model of the probability distribution of photon detection times, and (ii) prior beliefs about the structure of natural scenes. Reducing the number of photons needed for accurate image formation enables faster, farther, and safer acquisition. Still, such photon-efficient systems are often limited to laboratory conditions more favorable than the real-world settings in which they would be deployed.
This thesis focuses on expanding the photon detection time models to address challenging imaging scenarios and the effects of non-ideal acquisition equipment. The processing derived from these enhanced models, sometimes modified jointly with the acquisition hardware, surpasses the performance of state-of-the-art photon counting systems.
We first address the problem of high levels of ambient light, which causes traditional depth and reflectivity estimators to fail. We achieve robustness to strong ambient light through a rigorously derived window-based censoring method that separates signal and background light detections. Spatial correlations both within and between depth and reflectivity images are encoded in superpixel constructions, which fill in holes caused by the censoring. Accurate depth and reflectivity images can then be formed with an average of 2 signal photons and 50 background photons per pixel, outperforming methods previously demonstrated at a signal-to-background ratio of 1.
We next approach the problem of coarse temporal resolution for photon detection time measurements, which limits the precision of depth estimates. To achieve sub-bin depth precision, we propose a subtractively-dithered lidar implementation, which uses changing synchronization delays to shift the time-quantization bin edges. We examine the generic noise model resulting from dithering Gaussian-distributed signals and introduce a generalized Gaussian approximation to the noise distribution and simple order statistics-based depth estimators that take advantage of this model. Additional analysis of the generalized Gaussian approximation yields rules of thumb for determining when and how to apply dither to quantized measurements. We implement a dithered SPL system and propose a modification for non-Gaussian pulse shapes that outperforms the Gaussian assumption in practical experiments. The resulting dithered-lidar architecture could be used to design SPAD array detectors that can form precise depth estimates despite relaxed temporal quantization constraints.
Finally, SPAD dead time effects have been considered a major limitation for fast data acquisition in SPL, since a commonly adopted approach for dead time mitigation is to operate in the low-flux regime where dead time effects can be ignored. We show that the empirical distribution of detection times converges to the stationary distribution of a Markov chain and demonstrate improvements in depth estimation and histogram correction using our Markov chain model. An example simulation shows that correctly compensating for dead times in a high-flux measurement can yield a 20-times speed up of data acquisition. The resulting accuracy at high photon flux could enable real-time applications such as autonomous navigation
Physics-based Reconstruction and Animation of Humans
Creating digital representations of humans is of utmost importance for applications ranging from entertainment (video games, movies) to human-computer interaction and even psychiatrical treatments. What makes building credible digital doubles difficult is the fact that the human vision system is very sensitive to perceiving the complex expressivity and potential anomalies in body structures and motion. This thesis will present several projects that tackle these problems from two different perspectives: lightweight acquisition and physics-based simulation. It starts by describing a complete pipeline that allows users to reconstruct fully rigged 3D facial avatars using video data coming from a handheld device (e.g., smartphone). The avatars use a novel two-scale representation composed of blendshapes and dynamic detail maps. They are constructed through an optimization that integrates feature tracking, optical flow, and shape from shading. Continuing along the lines of accessible acquisition systems, we discuss a framework for simultaneous tracking and modeling of articulated human bodies from RGB-D data. We show how semantic information can be extracted from the scanned body shapes. In the second half of the thesis, we will deviate from using standard linear reconstruction and animation models, and rather focus on exploiting physics-based techniques that are able to incorporate complex phenomena such as dynamics, collision response and incompressibility of the materials. The first approach we propose assumes that each 3D scan of an actor records his body in a physical steady state and uses a process called inverse physics to extract a volumetric physics-ready anatomical model of him. By using biologically-inspired growth models for the bones, muscles and fat, our method can obtain realistic anatomical reconstructions that can be later on animated using external tracking data such as the one resulting from tracking motion capture markers. This is then extended to a novel physics-based approach for facial reconstruction and animation. We propose a facial animation model which simulates biomechanical muscle contractions in a volumetric head model in order to create the facial expressions seen in the input scans. We then show how this approach allows for new avenues of dynamic artistic control, simulation of corrective facial surgery, and interaction with external forces and objects