5,120 research outputs found
Motion parallax for 360° RGBD video
We present a method for adding parallax and real-time playback of 360° videos in Virtual Reality headsets. In current video players, the playback does not respond to translational head movement, which reduces the feeling of immersion, and causes motion sickness for some viewers. Given a 360° video and its corresponding depth (provided by current stereo 360° stitching algorithms), a naive image-based rendering approach would use the depth to generate a 3D mesh around the viewer, then translate it appropriately as the viewer moves their head. However, this approach breaks at depth discontinuities, showing visible distortions, whereas cutting the mesh at such discontinuities leads to ragged silhouettes and holes at disocclusions. We address these issues by improving the given initial depth map to yield cleaner, more natural silhouettes. We rely on a three-layer scene representation, made up of a foreground layer and two static background layers, to handle disocclusions by propagating information from multiple frames for the first background layer, and then inpainting for the second one. Our system works with input from many of today''s most popular 360° stereo capture devices (e.g., Yi Halo or GoPro Odyssey), and works well even if the original video does not provide depth information. Our user studies confirm that our method provides a more compelling viewing experience than without parallax, increasing immersion while reducing discomfort and nausea
Recommended from our members
A Novel Inpainting Framework for Virtual View Synthesis
Multi-view imaging has stimulated significant research to enhance the user experience of free viewpoint video, allowing interactive navigation between views and the freedom to select a desired view to watch. This usually involves transmitting both textural and depth information captured from different viewpoints to the receiver, to enable the synthesis of an arbitrary view. In rendering these virtual views, perceptual holes can appear due to certain regions, hidden in the original view by a closer object, becoming visible in the virtual view. To provide a high quality experience these holes must be filled in a visually plausible way, in a process known as inpainting. This is challenging because the missing information is generally unknown and the hole-regions can be large. Recently depth-based inpainting techniques have been proposed to address this challenge and while these generally perform better than non-depth assisted methods, they are not very robust and can produce perceptual artefacts.
This thesis presents a new inpainting framework that innovatively exploits depth and textural self-similarity characteristics to construct subjectively enhanced virtual viewpoints. The framework makes three significant contributions to the field: i) the exploitation of view information to jointly inpaint textural and depth hole regions; ii) the introduction of the novel concept of self-similarity characterisation which is combined with relevant depth information; and iii) an advanced self-similarity characterising scheme that automatically determines key spatial transform parameters for effective and flexible inpainting.
The presented inpainting framework has been critically analysed and shown to provide superior performance both perceptually and numerically compared to existing techniques, especially in terms of lower visual artefacts. It provides a flexible robust framework to develop new inpainting strategies for the next generation of interactive multi-view technologies
Reimagining Reality: A Comprehensive Survey of Video Inpainting Techniques
This paper offers a comprehensive analysis of recent advancements in video
inpainting techniques, a critical subset of computer vision and artificial
intelligence. As a process that restores or fills in missing or corrupted
portions of video sequences with plausible content, video inpainting has
evolved significantly with the advent of deep learning methodologies. Despite
the plethora of existing methods and their swift development, the landscape
remains complex, posing challenges to both novices and established researchers.
Our study deconstructs major techniques, their underpinning theories, and their
effective applications. Moreover, we conduct an exhaustive comparative study,
centering on two often-overlooked dimensions: visual quality and computational
efficiency. We adopt a human-centric approach to assess visual quality,
enlisting a panel of annotators to evaluate the output of different video
inpainting techniques. This provides a nuanced qualitative understanding that
complements traditional quantitative metrics. Concurrently, we delve into the
computational aspects, comparing inference times and memory demands across a
standardized hardware setup. This analysis underscores the balance between
quality and efficiency: a critical consideration for practical applications
where resources may be constrained. By integrating human validation and
computational resource comparison, this survey not only clarifies the present
landscape of video inpainting techniques but also charts a course for future
explorations in this vibrant and evolving field
Quasi-Modal Encounters Of The Third Kind: The Filling-In Of Visual Detail
Although Pessoa et al. imply that many aspects of the filling-in debate may be displaced by a regard for active vision, they remain loyal to naive neural reductionist explanations of certain pieces of psychophysical evidence. Alternative interpretations are provided for two specific examples and a new category of filling-in (of visual detail) is proposed
Completing unknown portions of 3D scenes by 3D visual propagation
Institute of Perception, Action and BehaviourAs the requirement for more realistic 3D environments is pushed forward by the computer {graphics | movie | simulation | games} industry, attention turns away from the creation of purely synthetic, artist derived environments towards the use of real world captures from the 3D world in which we live.
However, common 3D acquisition techniques, such as laser scanning and stereo capture, are realistically only 2.5D in nature - such that the backs and occluded portions of objects cannot be realised from a single uni-directional viewpoint. Although multi-directional capture has existed for sometime, this incurs additional temporal and
computational cost with no existing guarantee that the resulting acquisition will be free of minor holes, missing surfaces and alike.
Drawing inspiration from the study of human abilities in 3D visual completion, we consider the automated completion of these hidden or missing portions in 3D scenes originally acquired from 2.5D (or 3D) capture. We propose an approach based on the visual propagation of available scene knowledge from the known (visible) scene areas to these unknown (invisible) 3D regions (i.e. the completion of unknown volumes via visual propagation - the concept of volume completion).
Our proposed approach uses a combination of global surface fitting, to derive an initial underlying geometric surface completion, together with a 3D extension of nonparametric texture synthesis in order to provide the propagation of localised structural 3D surface detail (i.e. surface relief). We further extend our technique both to the combined completion of 3D surface relief and colour and additionally to hierarchical surface completion that offers both improved structural results and computational efficiency gains over our initial non-hierarchical technique.
To validate the success of these approaches we present the completion and extension of numerous 2.5D (and 3D) surface examples with relief ranging in natural, man-made, stochastic, regular and irregular forms. These results are evaluated both subjectively within our definition of plausible completion and quantitatively by statistical analysis in the geometric and colour domains
Visual analysis and synthesis with physically grounded constraints
The past decade has witnessed remarkable progress in image-based, data-driven vision and graphics. However, existing approaches often treat the images as pure 2D signals and not as a 2D projection of the physical 3D world. As a result, a lot of training examples are required to cover sufficiently diverse appearances and inevitably suffer from limited generalization capability. In this thesis, I propose "inference-by-composition" approaches to overcome these limitations by modeling and interpreting visual signals in terms of physical surface, object, and scene. I show how we can incorporate physically grounded constraints such as scene-specific geometry in a non-parametric optimization framework for (1) revealing the missing parts of an image due to removal of a foreground or background element, (2) recovering high spatial frequency details that are not resolvable in low-resolution observations. I then extend the framework from 2D images to handle spatio-temporal visual data (videos). I demonstrate that we can convincingly fill spatio-temporal holes in a temporally coherent fashion by jointly reconstructing the appearance and motion. Compared to existing approaches, our technique can synthesize physically plausible contents even in challenging videos. For visual analysis, I apply stereo camera constraints for discovering multiple approximately linear structures in extremely noisy videos with an ecological application to bird migration monitoring at night. The resulting algorithms are simple and intuitive while achieving state-of-the-art performance without the need of training on an exhaustive set of visual examples
Advanced editing methods for image and video sequences
In the context of image and video editing, this thesis proposes methods for modifying the semantic content of a recorded scene. Two different editing problems are approached: First, the removal of ghosting artifacts from high dynamic range (HDR) images recovered from exposure sequences, and second, the removal of objects from video sequences recorded with and without camera motion. These editings need to be performed in a way that the result looks plausible to humans, but without having to recover detailed models about the content of the scene, e.g. its geometry, reflectance, or illumination. The proposed editing methods add new key ingredients, such as camera noise models and global optimization frameworks, that help achieving results that surpass the capabilities of state-of-the-art methods. Using these ingredients, each proposed method defines local visual properties that approximate well the specific editing requirements of each task. These properties are then encoded into a energy function that, when globally minimized, produces the required editing results. The optimization of such energy functions corresponds to Bayesian inference problems that are solved efficiently using graph cuts. The proposed methods are demonstrated to outperform other state-ofthe-art methods. Furthermore, they are demonstrated to work well on complex real-world scenarios that have not been previously addressed in the literature, i.e., highly cluttered scenes for HDR deghosting, and highly dynamic scenes and unconstraint camera motion for object removal from videos.Diese Arbeit schlägt Methoden zur Änderung des semantischen Inhalts einer aufgenommenen Szene im Kontext der Bild-und Videobearbeitung vor. Zwei unterschiedliche Bearbeitungsmethoden werden angesprochen: Erstens, das Entfernen von Ghosting Artifacts (Geist-ähnliche Artefakte) aus High Dynamic Range (HDR) Bildern welche von Belichtungsreihen erstellt wurden und zweitens, das Entfernen von Objekten aus Videosequenzen mit und ohne Kamerabewegung. Das Bearbeiten muss in einer Weise durchgeführt werden, dass das Ergebnis für den Menschen plausibel aussieht, aber ohne das detaillierte Modelle des Szeneninhalts rekonstruiert werden müssen, z.B. die Geometrie, das Reflexionsverhalten, oder Beleuchtungseigenschaften. Die vorgeschlagenen Bearbeitungsmethoden beinhalten neuartige Elemente, etwa Kameralärm-Modelle und globale Optimierungs-Systeme, mit deren Hilfe es möglich ist die Eigenschaften der modernsten existierenden Methoden zu übertreffen. Mit Hilfe dieser Elemente definieren die vorgeschlagenen Methoden lokale visuelle Eigenschaften welche die beschriebenen Bearbeitungsmethoden gut annähern. Diese Eigenschaften werden dann als Energiefunktion codiert, welche, nach globalem minimieren, die gewünschten Bearbeitung liefert. Die Optimierung solcher Energiefunktionen entspricht dem Bayes’schen Inferenz Modell welches effizient mittels Graph-Cut Algorithmen gelöst werden kann. Es wird gezeigt, dass die vorgeschlagenen Methoden den heutigen Stand der Technik übertreffen. Darüber hinaus sind sie nachweislich gut auf komplexe natürliche Szenarien anwendbar, welche in der existierenden Literatur bisher noch nicht angegangen wurden, d.h. sehr unübersichtliche Szenen für HDR Deghosting und sehr dynamische Szenen und unbeschränkte Kamerabewegungen für das Entfernen von Objekten aus Videosequenzen
Fusing spatial and temporal components for real-time depth data enhancement of dynamic scenes
The depth images from consumer depth cameras (e.g., structured-light/ToF devices) exhibit a substantial amount of artifacts (e.g., holes, flickering, ghosting) that needs to be removed for real-world applications. Existing methods cannot entirely remove them and perform slow. This thesis proposes a new real-time spatio-temporal depth image enhancement filter that completely removes flickering and ghosting, and significantly reduces holes. This thesis also presents a novel depth-data capture setup and two data reduction methods to optimize the performance of the proposed enhancement method
Pathological completion: The blind leading the mind?
The taxonomy proposed by Pessoa et al. should be extended to include "pathological" completion phenomena in patients with unilateral brain damage. patients with visual field defects (hemianopias) may "complete" whole figures, while patients with parietal lobe damage may "complete" partial figures. We argue that the former may be consistent with the brain "filling-in" information, and the latter may be consistent with the brain ignoring the absence of information
A Unified Cognitive Model of Visual Filling-In Based on an Emergic Network Architecture
The Emergic Cognitive Model (ECM) is a unified computational model of visual filling-in based on the Emergic Network architecture. The Emergic Network was designed to help realize systems undergoing continuous change. In this thesis, eight different filling-in phenomena are demonstrated under a regime of continuous eye movement (and under static eye conditions as well).
ECM indirectly demonstrates the power of unification inherent with Emergic Networks when cognition is decomposed according to finer-grained functions supporting change. These can interact to raise additional emergent behaviours via cognitive re-use, hence the Emergic prefix throughout. Nevertheless, the model is robust and parameter free. Differential re-use occurs in the nature of model interaction with a particular testing paradigm.
ECM has a novel decomposition due to the requirements of handling motion and of supporting unified modelling via finer functional grains. The breadth of phenomenal behaviour covered is largely to lend credence to our novel decomposition.
The Emergic Network architecture is a hybrid between classical connectionism and classical computationalism that facilitates the construction of unified cognitive models. It helps cutting up of functionalism into finer-grains distributed over space (by harnessing massive recurrence) and over time (by harnessing continuous change), yet simplifies by using standard computer code to focus on the interaction of information flows. Thus while the structure of the network looks neurocentric, the dynamics are best understood in flowcentric terms. Surprisingly, dynamic system analysis (as usually understood) is not involved. An Emergic Network is engineered much like straightforward software or hardware systems that deal with continuously varying inputs. Ultimately, this thesis addresses the problem of reduction and induction over complex systems, and the Emergic Network architecture is merely a tool to assist in this epistemic endeavour.
ECM is strictly a sensory model and apart from perception, yet it is informed by phenomenology. It addresses the attribution problem of how much of a phenomenon is best explained at a sensory level of analysis, rather than at a perceptual one. As the causal information flows are stable under eye movement, we hypothesize that they are the locus of consciousness, howsoever it is ultimately realized
- …