3,058 research outputs found
Impact of Imaging and Distance Perception in VR Immersive Visual Experience
Virtual reality (VR) headsets have evolved to include unprecedented viewing quality. Meanwhile, they have become lightweight, wireless, and low-cost, which has opened to new applications and a much wider audience. VR headsets can now provide users with greater understanding of events and accuracy of observation, making decision-making faster and more effective. However, the spread of immersive technologies has shown a slow take-up, with the adoption of virtual reality limited to a few applications, typically related to entertainment. This reluctance appears to be due to the often-necessary change of operating paradigm and some scepticism towards the "VR advantage". The need therefore arises to evaluate the contribution that a VR system can make to user performance, for example to monitoring and decision-making. This will help system designers understand when immersive technologies can be proposed to replace or complement standard display systems such as a desktop monitor.
In parallel to the VR headsets evolution there has been that of 360 cameras, which are now capable to instantly acquire photographs and videos in stereoscopic 3D (S3D) modality, with very high resolutions. 360° images are innately suited to VR headsets, where the captured view can be observed and explored through the natural rotation of the head. Acquired views can even be experienced and navigated from the inside as they are captured.
The combination of omnidirectional images and VR headsets has opened to a new way of creating immersive visual representations. We call it: photo-based VR. This represents a new methodology that combines traditional model-based rendering with high-quality omnidirectional texture-mapping. Photo-based VR is particularly suitable for applications related to remote visits and realistic scene reconstruction, useful for monitoring and surveillance systems, control panels and operator training.
The presented PhD study investigates the potential of photo-based VR representations. It starts by evaluating the role of immersion and userâs performance in today's graphical visual experience, to then use it as a reference to develop and evaluate new photo-based VR solutions. With the current literature on photo-based VR experience and associated user performance being very limited, this study builds new knowledge from the proposed assessments.
We conduct five user studies on a few representative applications examining how visual representations can be affected by system factors (camera and display related) and how it can influence human factors (such as realism, presence, and emotions). Particular attention is paid to realistic depth perception, to support which we develop target solutions for photo-based VR. They are intended to provide users with a correct perception of space dimension and objects size. We call it: true-dimensional visualization.
The presented work contributes to unexplored fields including photo-based VR and true-dimensional visualization, offering immersive system designers a thorough comprehension of the benefits, potential, and type of applications in which these new methods can make the difference.
This thesis manuscript and its findings have been partly presented in scientific publications. In particular, five conference papers on Springer and the IEEE symposia, [1], [2], [3], [4], [5], and one journal article in an IEEE periodical [6], have been published
DILF: Differentiable Rendering-Based Multi-View Image-Language Fusion for Zero-Shot 3D Shape Understanding
Zero-shot 3D shape understanding aims to recognize âunseenâ 3D categories that are not present in training data. Recently, Contrastive LanguageâImage Pre-training (CLIP) has shown promising open-world performance in zero-shot 3D shape understanding tasks by information fusion among language and 3D modality. It first renders 3D objects into multiple 2D image views and then learns to understand the semantic relationships between the textual descriptions and images, enabling the model to generalize to new and unseen categories. However, existing studies in zero-shot 3D shape understanding rely on predefined rendering parameters, resulting in repetitive, redundant, and low-quality views. This limitation hinders the modelâs ability to fully comprehend 3D shapes and adversely impacts the textâimage fusion in a shared latent space. To this end, we propose a novel approach called Differentiable rendering-based multi-view ImageâLanguage Fusion (DILF) for zero-shot 3D shape understanding. Specifically, DILF leverages large-scale language models (LLMs) to generate textual prompts enriched with 3D semantics and designs a differentiable renderer with learnable rendering parameters to produce representative multi-view images. These rendering parameters can be iteratively updated using a textâimage fusion loss, which aids in parametersâ regression, allowing the model to determine the optimal viewpoint positions for each 3D object. Then a group-view mechanism is introduced to model interdependencies across views, enabling efficient information fusion to achieve a more comprehensive 3D shape understanding. Experimental results can demonstrate that DILF outperforms state-of-the-art methods for zero-shot 3D classification while maintaining competitive performance for standard 3D classification. The code is available at https://github.com/yuzaiyang123/DILP
Self-supervised learning for transferable representations
Machine learning has undeniably achieved remarkable advances thanks to large labelled datasets and supervised learning. However, this progress is constrained by the labour-intensive annotation process. It is not feasible to generate extensive labelled datasets for every problem we aim to address. Consequently, there has been a notable shift in recent times toward approaches that solely leverage raw data. Among these, self-supervised learning has emerged as a particularly powerful approach, offering scalability to massive datasets and showcasing considerable potential for effective knowledge transfer. This thesis investigates self-supervised representation learning with a strong focus on computer vision applications. We provide a comprehensive survey of self-supervised methods across various modalities, introducing a taxonomy that categorises them into four distinct families while also highlighting practical considerations for real-world implementation. Our focus thenceforth is on the computer vision modality, where we perform a comprehensive benchmark evaluation of state-of-the-art self supervised models against many diverse downstream transfer tasks. Our findings reveal that self-supervised models often outperform supervised learning across a spectrum of tasks, albeit with correlations weakening as tasks transition beyond classification, particularly for datasets with distribution shifts. Digging deeper, we investigate the influence of data augmentation on the transferability of contrastive learners, uncovering a trade-off between spatial and appearance-based invariances that generalise to real-world transformations. This begins to explain the differing empirical performances achieved by self-supervised learners on different downstream tasks, and it showcases the advantages of specialised representations produced with tailored augmentation. Finally, we introduce a novel self-supervised pre-training algorithm for object detection, aligning pre-training with downstream architecture and objectives, leading to reduced localisation errors and improved label efficiency. In conclusion, this thesis contributes a comprehensive understanding of self-supervised representation learning and its role in enabling effective transfer across computer vision tasks
Volumetric Occupancy Detection: A Comparative Analysis of Mapping Algorithms
Despite the growing interest in innovative functionalities for collaborative
robotics, volumetric detection remains indispensable for ensuring basic
security. However, there is a lack of widely used volumetric detection
frameworks specifically tailored to this domain, and existing evaluation
metrics primarily focus on time and memory efficiency. To bridge this gap, the
authors present a detailed comparison using a simulation environment, ground
truth extraction, and automated evaluation metrics calculation. This enables
the evaluation of state-of-the-art volumetric mapping algorithms, including
OctoMap, SkiMap, and Voxblox, providing valuable insights and comparisons
through the impact of qualitative and quantitative analyses. The study not only
compares different frameworks but also explores various parameters within each
framework, offering additional insights into their performance.Comment: 11 pages, 11 figures, 9 table
Scalable Exploration of Complex Objects and Environments Beyond Plain Visual Replicationâ
Digital multimedia content and presentation means are rapidly increasing their sophistication and are now capable of describing detailed representations of the physical world. 3D exploration experiences allow people to appreciate, understand and interact with intrinsically virtual objects.
Communicating information on objects requires the ability to explore them under different angles, as well as to mix highly photorealistic or illustrative presentations of the object themselves with additional data that provides additional insights on these objects, typically represented in the form of annotations. Effectively providing these capabilities requires the solution of important problems in visualization and user interaction.
In this thesis, I studied these problems in the cultural heritage-computing-domain, focusing on the very common and important special case of mostly planar, but visually, geometrically, and semantically rich objects. These could be generally roughly flat objects with a standard frontal viewing direction (e.g., paintings, inscriptions, bas-reliefs), as well as visualizations of fully 3D objects from a particular point of views (e.g., canonical views of buildings or statues). Selecting a precise application domain and a specific presentation mode allowed me to concentrate on the well defined use-case of the exploration of annotated relightable stratigraphic models (in particular, for local and remote museum presentation).
My main results and contributions to the state of the art have been a novel technique for interactively controlling visualization lenses while automatically maintaining good focus-and-context parameters, a novel approach for avoiding clutter in an annotated model and for guiding users towards interesting areas, and a method for structuring audio-visual object annotations into a graph and for using that graph to improve guidance and support storytelling and automated tours.
We demonstrated the effectiveness and potential of our techniques by performing interactive exploration sessions on various screen sizes and types ranging from desktop devices to large-screen displays for a walk-up-and-use museum installation.
KEYWORDS - Computer Graphics, Human-Computer Interaction, Interactive Lenses, Focus-and-Context, Annotated Models, Cultural Heritage Computing
Influence of Bed Roughness on Flow and Turbulence Structure Around a Partially-Buried, Isolated Freshwater Mussel
The present study uses eddy-resolving numerical simulations to investigate how bed roughness affects flow and turbulence structure around an isolated, partially-buried mussel (Unio elongatulus) aligned with the incoming flow. The rough-bed simulations resolve the flow past the exposed part of a gravel bed, whose surface is obtained from a laboratory experiment that also provides some additional data for validation of the numerical model. Results are also discussed for the limiting case of a horizontal smooth bed. Additionally, the effects of varying the level of burial of the mussel inside the substrate and the discharge through the two mussel siphons are investigated via a set of simulations in which the ratio between the median diameter of the (gravel) particles forming the rough bed, d50, and the height of the exposed part of the mussel, h, varies between 0.10 and 0.22. The increase of the bed roughness is associated with a strong amplification of the turbulence kinetic energy in the near-wake region. Increasing the bed roughness and/or reducing h intensifies the interactions of the eddies generated by the bed particles with the base and tip vortices induced by the active filtering and by the mussel shell, respectively, which, in turn, induces a more rapid dissipation of these vortices. Increasing the bed roughness also reduces the strength of the main downwelling flow region forming in the wake. The strong downwelling near the symmetry plane is the main reason why the symmetric wake shedding mode dominates in the smooth bed simulations with negligible active filtering. By contrast, the anti-symmetric wake shedding mode dominates in the simulations conduced with a high value of the bed roughness. The mean streamwise drag force coefficient for the emerged part of the shell and the dilution of the excurrent siphon jet increase with increasing bed roughness
Emerging Approaches for THz Array Imaging: A Tutorial Review and Software Tool
Accelerated by the increasing attention drawn by 5G, 6G, and Internet of
Things applications, communication and sensing technologies have rapidly
evolved from millimeter-wave (mmWave) to terahertz (THz) in recent years.
Enabled by significant advancements in electromagnetic (EM) hardware, mmWave
and THz frequency regimes spanning 30 GHz to 300 GHz and 300 GHz to 3000 GHz,
respectively, can be employed for a host of applications. The main feature of
THz systems is high-bandwidth transmission, enabling ultra-high-resolution
imaging and high-throughput communications; however, challenges in both the
hardware and algorithmic arenas remain for the ubiquitous adoption of THz
technology. Spectra comprising mmWave and THz frequencies are well-suited for
synthetic aperture radar (SAR) imaging at sub-millimeter resolutions for a wide
spectrum of tasks like material characterization and nondestructive testing
(NDT). This article provides a tutorial review of systems and algorithms for
THz SAR in the near-field with an emphasis on emerging algorithms that combine
signal processing and machine learning techniques. As part of this study, an
overview of classical and data-driven THz SAR algorithms is provided, focusing
on object detection for security applications and SAR image super-resolution.
We also discuss relevant issues, challenges, and future research directions for
emerging algorithms and THz SAR, including standardization of system and
algorithm benchmarking, adoption of state-of-the-art deep learning techniques,
signal processing-optimized machine learning, and hybrid data-driven signal
processing algorithms...Comment: Submitted to Proceedings of IEE
Semantic-aware Transmission for Robust Point Cloud Classification
As three-dimensional (3D) data acquisition devices become increasingly
prevalent, the demand for 3D point cloud transmission is growing. In this
study, we introduce a semantic-aware communication system for robust point
cloud classification that capitalizes on the advantages of pre-trained
Point-BERT models. Our proposed method comprises four main components: the
semantic encoder, channel encoder, channel decoder, and semantic decoder. By
employing a two-stage training strategy, our system facilitates efficient and
adaptable learning tailored to the specific classification tasks. The results
show that the proposed system achieves classification accuracy of over 89\%
when SNR is higher than 10 dB and still maintains accuracy above 66.6\% even at
SNR of 4 dB. Compared to the existing method, our approach performs at 0.8\% to
48\% better across different SNR values, demonstrating robustness to channel
noise. Our system also achieves a balance between accuracy and speed, being
computationally efficient while maintaining high classification performance
under noisy channel conditions. This adaptable and resilient approach holds
considerable promise for a wide array of 3D scene understanding applications,
effectively addressing the challenges posed by channel noise.Comment: submitted to globecom 202
Deceptive-NeRF: Enhancing NeRF Reconstruction using Pseudo-Observations from Diffusion Models
This paper introduces Deceptive-NeRF, a new method for enhancing the quality
of reconstructed NeRF models using synthetically generated pseudo-observations,
capable of handling sparse input and removing floater artifacts. Our proposed
method involves three key steps: 1) reconstruct a coarse NeRF model from sparse
inputs; 2) generate pseudo-observations based on the coarse model; 3) refine
the NeRF model using pseudo-observations to produce a high-quality
reconstruction. To generate photo-realistic pseudo-observations that faithfully
preserve the identity of the reconstructed scene while remaining consistent
with the sparse inputs, we develop a rectification latent diffusion model that
generates images conditional on a coarse RGB image and depth map, which are
derived from the coarse NeRF and latent text embedding from input images.
Extensive experiments show that our method is effective and can generate
perceptually high-quality NeRF even with very sparse inputs
Enhancing Perception and Immersion in Pre-Captured Environments through Learning-Based Eye Height Adaptation
Pre-captured immersive environments using omnidirectional cameras provide a
wide range of virtual reality applications. Previous research has shown that
manipulating the eye height in egocentric virtual environments can
significantly affect distance perception and immersion. However, the influence
of eye height in pre-captured real environments has received less attention due
to the difficulty of altering the perspective after finishing the capture
process. To explore this influence, we first propose a pilot study that
captures real environments with multiple eye heights and asks participants to
judge the egocentric distances and immersion. If a significant influence is
confirmed, an effective image-based approach to adapt pre-captured real-world
environments to the user's eye height would be desirable. Motivated by the
study, we propose a learning-based approach for synthesizing novel views for
omnidirectional images with altered eye heights. This approach employs a
multitask architecture that learns depth and semantic segmentation in two
formats, and generates high-quality depth and semantic segmentation to
facilitate the inpainting stage. With the improved omnidirectional-aware
layered depth image, our approach synthesizes natural and realistic visuals for
eye height adaptation. Quantitative and qualitative evaluation shows favorable
results against state-of-the-art methods, and an extensive user study verifies
improved perception and immersion for pre-captured real-world environments.Comment: 10 pages, 13 figures, 3 tables, submitted to ISMAR 202
- âŠ