1,176 research outputs found
Computer Vision-Based Hand Tracking and 3D Reconstruction as a Human-Computer Input Modality with Clinical Application
The recent pandemic has impeded patients with hand injuries from connecting in person with their therapists. To address this challenge and improve hand telerehabilitation, we propose two computer vision-based technologies, photogrammetry and augmented reality as alternative and affordable solutions for visualization and remote monitoring of hand trauma without costly equipment. In this thesis, we extend the application of 3D rendering and virtual reality-based user interface to hand therapy. We compare the performance of four popular photogrammetry software in reconstructing a 3D model of a synthetic human hand from videos captured through a smartphone. The visual quality, reconstruction time and geometric accuracy of output model meshes are compared. Reality Capture produces the best result, with output mesh having the least error of 1mm and a total reconstruction time of 15 minutes. We developed an augmented reality app using MediaPipe algorithms that extract hand key points, finger joint coordinates and angles in real-time from hand images or live stream media. We conducted a study to investigate its input variability and validity as a reliable tool for remote assessment of finger range of motion. The intraclass correlation coefficient between DIGITS and in-person measurement obtained is 0.767- 0.81 for finger extension and 0.958â0.857 for finger flexion. Finally, we develop and surveyed the usability of a mobile application that collects patient data medical history, self-reported pain levels and hand 3D models and transfer them to therapists. These technologies can improve hand telerehabilitation, aid clinicians in monitoring hand conditions remotely and make decisions on appropriate therapy, medication, and hand orthoses
Towards Reliable and Accurate Global Structure-from-Motion
Reconstruction of objects or scenes from sparse point detections across multiple views is one of the most tackled problems in computer vision. Given the coordinates of 2D points tracked in multiple images, the problem consists of estimating the corresponding 3D points and cameras\u27 calibrations (intrinsic and pose), and can be solved by minimizing reprojection errors using bundle adjustment. However, given bundle adjustment\u27s nonlinear objective function and iterative nature, a good starting guess is required to converge to global minima. Global and Incremental Structure-from-Motion methods appear as ways to provide good initializations to bundle adjustment, each with different properties. While Global Structure-from-Motion has been shown to result in more accurate reconstructions compared to Incremental Structure-from-Motion, the latter has better scalability by starting with a small subset of images and sequentially adding new views, allowing reconstruction of sequences with millions of images. Additionally, both Global and Incremental Structure-from-Motion methods rely on accurate models of the scene or object, and under noisy conditions or high model uncertainty might result in poor initializations for bundle adjustment. Recently pOSE, a class of matrix factorization methods, has been proposed as an alternative to conventional Global SfM methods. These methods use VarPro - a second-order optimization method - to minimize a linear combination of an approximation of reprojection errors and a regularization term based on an affine camera model, and have been shown to converge to global minima with a high rate even when starting from random camera calibration estimations.This thesis aims at improving the reliability and accuracy of global SfM through different approaches. First, by studying conditions for global optimality of point set registration, a point cloud averaging method that can be used when (incomplete) 3D point clouds of the same scene in different coordinate systems are available. Second, by extending pOSE methods to different Structure-from-Motion problem instances, such as Non-Rigid SfM or radial distortion invariant SfM. Third and finally, by replacing the regularization term of pOSE methods with an exponential regularization on the projective depth of the 3D point estimations, resulting in a loss that achieves reconstructions with accuracy close to bundle adjustment
Visibility-Aware Pixelwise View Selection for Multi-View Stereo Matching
The performance of PatchMatch-based multi-view stereo algorithms depends
heavily on the source views selected for computing matching costs. Instead of
modeling the visibility of different views, most existing approaches handle
occlusions in an ad-hoc manner. To address this issue, we propose a novel
visibility-guided pixelwise view selection scheme in this paper. It
progressively refines the set of source views to be used for each pixel in the
reference view based on visibility information provided by already validated
solutions. In addition, the Artificial Multi-Bee Colony (AMBC) algorithm is
employed to search for optimal solutions for different pixels in parallel.
Inter-colony communication is performed both within the same image and among
different images. Fitness rewards are added to validated and propagated
solutions, effectively enforcing the smoothness of neighboring pixels and
allowing better handling of textureless areas. Experimental results on the DTU
dataset show our method achieves state-of-the-art performance among
non-learning-based methods and retrieves more details in occluded and
low-textured regions.Comment: 8 page
A Quantitative Evaluation of Dense 3D Reconstruction of Sinus Anatomy from Monocular Endoscopic Video
Generating accurate 3D reconstructions from endoscopic video is a promising
avenue for longitudinal radiation-free analysis of sinus anatomy and surgical
outcomes. Several methods for monocular reconstruction have been proposed,
yielding visually pleasant 3D anatomical structures by retrieving relative
camera poses with structure-from-motion-type algorithms and fusion of monocular
depth estimates. However, due to the complex properties of the underlying
algorithms and endoscopic scenes, the reconstruction pipeline may perform
poorly or fail unexpectedly. Further, acquiring medical data conveys additional
challenges, presenting difficulties in quantitatively benchmarking these
models, understanding failure cases, and identifying critical components that
contribute to their precision. In this work, we perform a quantitative analysis
of a self-supervised approach for sinus reconstruction using endoscopic
sequences paired with optical tracking and high-resolution computed tomography
acquired from nine ex-vivo specimens. Our results show that the generated
reconstructions are in high agreement with the anatomy, yielding an average
point-to-mesh error of 0.91 mm between reconstructions and CT segmentations.
However, in a point-to-point matching scenario, relevant for endoscope tracking
and navigation, we found average target registration errors of 6.58 mm. We
identified that pose and depth estimation inaccuracies contribute equally to
this error and that locally consistent sequences with shorter trajectories
generate more accurate reconstructions. These results suggest that achieving
global consistency between relative camera poses and estimated depths with the
anatomy is essential. In doing so, we can ensure proper synergy between all
components of the pipeline for improved reconstructions that will facilitate
clinical application of this innovative technology
Differentiable Display Photometric Stereo
Photometric stereo leverages variations in illumination conditions to
reconstruct per-pixel surface normals. The concept of display photometric
stereo, which employs a conventional monitor as an illumination source, has the
potential to overcome limitations often encountered in bulky and
difficult-to-use conventional setups. In this paper, we introduce
Differentiable Display Photometric Stereo (DDPS), a method designed to achieve
high-fidelity normal reconstruction using an off-the-shelf monitor and camera.
DDPS addresses a critical yet often neglected challenge in photometric stereo:
the optimization of display patterns for enhanced normal reconstruction. We
present a differentiable framework that couples basis-illumination image
formation with a photometric-stereo reconstruction method. This facilitates the
learning of display patterns that leads to high-quality normal reconstruction
through automatic differentiation. Addressing the synthetic-real domain gap
inherent in end-to-end optimization, we propose the use of a real-world
photometric-stereo training dataset composed of 3D-printed objects. Moreover,
to reduce the ill-posed nature of photometric stereo, we exploit the linearly
polarized light emitted from the monitor to optically separate diffuse and
specular reflections in the captured images. We demonstrate that DDPS allows
for learning display patterns optimized for a target configuration and is
robust to initialization. We assess DDPS on 3D-printed objects with
ground-truth normals and diverse real-world objects, validating that DDPS
enables effective photometric-stereo reconstruction
An Explicit Method for Fast Monocular Depth Recovery in Corridor Environments
Monocular cameras are extensively employed in indoor robotics, but their
performance is limited in visual odometry, depth estimation, and related
applications due to the absence of scale information.Depth estimation refers to
the process of estimating a dense depth map from the corresponding input image,
existing researchers mostly address this issue through deep learning-based
approaches, yet their inference speed is slow, leading to poor real-time
capabilities. To tackle this challenge, we propose an explicit method for rapid
monocular depth recovery specifically designed for corridor environments,
leveraging the principles of nonlinear optimization. We adopt the virtual
camera assumption to make full use of the prior geometric features of the
scene. The depth estimation problem is transformed into an optimization problem
by minimizing the geometric residual. Furthermore, a novel depth plane
construction technique is introduced to categorize spatial points based on
their possible depths, facilitating swift depth estimation in enclosed
structural scenarios, such as corridors. We also propose a new corridor
dataset, named Corr\_EH\_z, which contains images as captured by the UGV camera
of a variety of corridors. An exhaustive set of experiments in different
corridors reveal the efficacy of the proposed algorithm.Comment: 10 pages, 8 figures. arXiv admin note: text overlap with
arXiv:2111.08600 by other author
A Neural Height-Map Approach for the Binocular Photometric Stereo Problem
In this work we propose a novel, highly practical, binocular photometric
stereo (PS) framework, which has same acquisition speed as single view PS,
however significantly improves the quality of the estimated geometry.
As in recent neural multi-view shape estimation frameworks such as NeRF,
SIREN and inverse graphics approaches to multi-view photometric stereo (e.g.
PS-NeRF) we formulate shape estimation task as learning of a differentiable
surface and texture representation by minimising surface normal discrepancy for
normals estimated from multiple varying light images for two views as well as
discrepancy between rendered surface intensity and observed images. Our method
differs from typical multi-view shape estimation approaches in two key ways.
First, our surface is represented not as a volume but as a neural heightmap
where heights of points on a surface are computed by a deep neural network.
Second, instead of predicting an average intensity as PS-NeRF or introducing
lambertian material assumptions as Guo et al., we use a learnt BRDF and perform
near-field per point intensity rendering.
Our method achieves the state-of-the-art performance on the DiLiGenT-MV
dataset adapted to binocular stereo setup as well as a new binocular
photometric stereo dataset - LUCES-ST.Comment: WACV 202
Recommended from our members
Sonic heritage: listening to the past
History is so often told through objects, images and photographs, but the potential of sounds to reveal place and space is often neglected. Our research project âSonic Palimpsestâ1 explores the potential of sound to evoke impressions and new understandings of the past, to embrace the sonic as a tool to understand what was, in a way that can complement and add to our predominant visual understandings. Our work includes the expansion of the Oral History archives held at Chatham Dockyard to include womenâs voices and experiences, and the creation of sonic works to engage the public with their heritage. Our research highlights the social and cultural value of oral history and field recordings in the transmission of knowledge to both researchers and the public. Together these recordings document how buildings and spaces within the dockyard were used and experienced by those who worked there. We can begin to understand the social and cultural roles of these buildings within the community, both past and present
Visual Guidance for Unmanned Aerial Vehicles with Deep Learning
Unmanned Aerial Vehicles (UAVs) have been widely applied in the military and civilian domains. In recent years, the operation mode of UAVs is evolving from teleoperation to autonomous flight. In order to fulfill the goal of autonomous flight, a reliable guidance system is essential. Since the combination of Global Positioning System (GPS) and Inertial Navigation System (INS) systems cannot sustain autonomous flight in some situations where GPS can be degraded or unavailable, using computer vision as a primary method for UAV guidance has been widely explored. Moreover, GPS does not provide any information to the robot on the presence of obstacles.
Stereo cameras have complex architecture and need a minimum baseline to generate disparity map. By contrast, monocular cameras are simple and require less hardware resources. Benefiting from state-of-the-art Deep Learning (DL) techniques, especially Convolutional Neural Networks (CNNs), a monocular camera is sufficient to extrapolate mid-level visual representations such as depth maps and optical flow (OF) maps from the environment. Therefore, the objective of this thesis is to develop a real-time visual guidance method for UAVs in cluttered environments using a monocular camera and DL.
The three major tasks performed in this thesis are investigating the development of DL techniques and monocular depth estimation (MDE), developing real-time CNNs for MDE, and developing visual guidance methods on the basis of the developed MDE system. A comprehensive survey is conducted, which covers Structure from Motion (SfM)-based methods, traditional handcrafted feature-based methods, and state-of-the-art DL-based methods. More importantly, it also investigates the application of MDE in robotics. Based on the survey, two CNNs for MDE are developed. In addition to promising accuracy performance, these two CNNs run at high frame rates (126 fps and 90 fps respectively), on a single modest power Graphical Processing Unit (GPU).
As regards the third task, the visual guidance for UAVs is first developed on top of the designed MDE networks. To improve the robustness of UAV guidance, OF maps are integrated into the developed visual guidance method. A cross-attention module is applied to fuse the features learned from the depth maps and OF maps. The fused features are then passed through a deep reinforcement learning (DRL) network to generate the policy for guiding the flight of UAV. Additionally, a simulation framework is developed which integrates AirSim, Unreal Engine and PyTorch. The effectiveness of the developed visual guidance method is validated through extensive experiments in the simulation framework
Neural Reflectance Decomposition
Die Erstellung von fotorealistischen Modellen von Objekten aus Bildern oder Bildersammlungen ist eine grundlegende Herausforderung in der Computer Vision und Grafik. Dieses Problem wird auch als inverses Rendering bezeichnet. Eine der gröĂten Herausforderungen bei dieser Aufgabe ist die vielfĂ€ltige AmbiguitĂ€t. Der Prozess Bilder aus 3D-Objekten zu erzeugen wird Rendering genannt. Allerdings beeinflussen sich mehrere Eigenschaften wie Form, Beleuchtung und die ReflektivitĂ€t der OberflĂ€che gegenseitig. ZusĂ€tzlich wird eine Integration dieser EinflĂŒsse durchgefĂŒhrt, um das endgĂŒltige Bild zu erzeugen. Die Umkehrung dieser integrierten AbhĂ€ngigkeiten ist eine Ă€uĂerst schwierige und mehrdeutige Aufgabenstellung. Die Lösung dieser Aufgabe ist jedoch von entscheidender Bedeutung, da die automatisierte Erstellung solcher wieder beleuchtbaren Objekte verschiedene Anwendungen in den Bereichen Online-Shopping, Augmented Reality (AR), Virtual Reality (VR), Spiele oder Filme hat.
In dieser Arbeit werden zwei AnsĂ€tze zur Lösung dieser Aufgabe beschrieben. Erstens wird eine Netzwerkarchitektur vorgestellt, die die Erfassung eines Objekts und dessen Materialien von zwei Aufnahmen ermöglicht. Der Grad der Blicksynthese von diesen Objekten ist jedoch begrenzt, da bei der Dekomposition nur eine einzige Perspektive verwendet wird. Daher wird eine zweite Reihe von AnsĂ€tzen vorgeschlagen, bei denen eine Sammlung von 360 Grad verteilten Bildern in die Form, Reflektanz und Beleuchtung gespalten werden. Diese Multi-View-Bilder werden pro Objekt optimiert. Das resultierende Objekt kann direkt in handelsĂŒblicher Rendering-Software oder in Spielen verwendet werden. Wir erreichen dies, indem wir die aktuelle Forschung zu neuronalen Feldern erweitern Reflektanz zu speichern. Durch den Einsatz von Volumen-Rendering-Techniken können wir ein Reflektanzfeld aus natĂŒrlichen Bildsammlungen ohne jegliche Ground Truth (GT) Ăberwachung optimieren.
Die von uns vorgeschlagenen Methoden erreichen eine erstklassige QualitĂ€t der Dekomposition und ermöglichen neuartige Aufnahmesituationen, in denen sich Objekte unter verschiedenen Beleuchtungsbedingungen oder an verschiedenen Orten befinden können, was ĂŒblich fĂŒr Online-Bildsammlungen ist.Creating relightable objects from images or collections is a fundamental challenge in computer vision and graphics. This problem is also known as inverse rendering. One of the main challenges in this task is the high ambiguity. The creation of images from 3D objects is well defined as rendering. However, multiple properties such as shape, illumination, and surface reflectiveness influence each other. Additionally, an integration of these influences is performed to form the final image. Reversing these integrated dependencies is highly ill-posed and ambiguous. However, solving the task is essential, as automated creation of relightable objects has various applications in online shopping, augmented reality (AR), virtual reality (VR), games, or movies.
In this thesis, we propose two approaches to solve this task. First, a network architecture is discussed, which generalizes the decomposition of a two-shot capture of an object from large training datasets. The degree of novel view synthesis is limited as only a singular perspective is used in the decomposition. Therefore, the second set of approaches is proposed, which decomposes a set of 360-degree images. These multi-view images are optimized per object, and the result can be directly used in standard rendering software or games. We achieve this by extending recent research on Neural Fields, which can store information in a 3D neural volume. Leveraging volume rendering techniques, we can optimize a reflectance field from in-the-wild image collections without any ground truth (GT) supervision.
Our proposed methods achieve state-of-the-art decomposition quality and enable novel capture setups where objects can be under varying illumination or in different locations, which is typical for online image collections
- âŠ