259 research outputs found
Stereoscopic video quality assessment using binocular energy
Stereoscopic imaging is becoming increasingly popular. However, to ensure the best quality of experience, there is a need to develop more robust and accurate objective metrics for stereoscopic content quality assessment. Existing stereoscopic image and video metrics are either extensions of conventional 2D
metrics (with added depth or disparity information) or are based on relatively simple perceptual models. Consequently, they tend to lack the accuracy and robustness required for stereoscopic content quality assessment. This paper introduces full-reference stereoscopic image and video quality metrics based on a Human
Visual System (HVS) model incorporating important physiological findings on binocular vision. The proposed approach is based on the following three contributions. First, it introduces a novel HVS model extending previous models to include the phenomena of binocular suppression and recurrent excitation. Second, an image quality metric based on the novel HVS model
is proposed. Finally, an optimised temporal pooling strategy is introduced to extend the metric to the video domain. Both image and video quality metrics are obtained via a training procedure to establish a relationship between subjective scores and objective measures of the HVS model. The metrics are evaluated using
publicly available stereoscopic image/video databases as well as a new stereoscopic video database. An extensive experimental evaluation demonstrates the robustness of the proposed quality metrics. This indicates a considerable improvement with respect to the state-of-the-art with average correlations with subjective
scores of 0.86 for the proposed stereoscopic image metric and 0.89 and 0.91 for the proposed stereoscopic video metrics
GazeStereo3D: seamless disparity manipulations
Producing a high quality stereoscopic impression on current displays is a challenging task. The content has to be carefully prepared in order to maintain visual comfort, which typically affects the quality of depth reproduction. In this work, we show that this problem can be significantly alleviated when the eye fixation regions can be roughly estimated. We propose a new method for stereoscopic depth adjustment that utilizes eye tracking or other gaze prediction information. The key idea that distinguishes our approach from the previous work is to apply gradual depth adjustments at the eye fixation stage, so that they remain unnoticeable. To this end, we measure the limits imposed on the speed of disparity changes in various depth adjustment scenarios, and formulate a new model that can guide such seamless stereoscopic content processing. Based on this model, we propose a real-time controller that applies local manipulations to stereoscopic content to find the optimum between depth reproduction and visual comfort. We show that the controller is mostly immune to the limitations of low-cost eye tracking solutions. We also demonstrate benefits of our model in off-line applications, such as stereoscopic movie production, where skillful directors can reliably guide and predict viewers' attention or where attended image regions are identified during eye tracking sessions. We validate both our model and the controller in a series of user experiments. They show significant improvements in depth perception without sacrificing the visual quality when our techniques are applied
Combining Features and Semantics for Low-level Computer Vision
Visual perception of depth and motion plays a significant role in understanding and navigating the environment.
Reconstructing outdoor scenes in 3D and estimating the motion from video cameras are of utmost importance for applications like autonomous driving.
The corresponding problems in computer vision have witnessed tremendous progress over the last decades, yet some aspects still remain challenging today. Striking examples are reflecting and textureless surfaces or large motions which cannot be easily recovered using traditional local methods. Further challenges include occlusions, large distortions and difficult lighting conditions. In this thesis, we propose to overcome these challenges by modeling non-local interactions leveraging semantics and contextual information.
Firstly, for binocular stereo estimation, we propose to regularize over larger areas on the image using object-category specific disparity proposals which we sample using inverse graphics techniques based on a sparse disparity estimate and a semantic segmentation of the image. The disparity proposals encode the fact that objects of certain categories are not arbitrarily shaped but typically exhibit regular structures. We integrate them as non-local regularizer for the challenging object class 'car' into a superpixel-based graphical model and demonstrate its benefits especially in reflective regions.
Secondly, for 3D reconstruction, we leverage the fact that the larger the reconstructed area, the more likely objects of similar type and shape will occur in the scene. This is particularly true for outdoor scenes where buildings and vehicles often suffer from missing texture or reflections, but share similarity in 3D shape. We take advantage of this shape similarity by localizing objects using detectors and jointly reconstructing them while learning a volumetric model of their shape. This allows to reduce noise while completing missing surfaces as objects of similar shape benefit from all observations for the respective category. Evaluations with respect to LIDAR ground-truth on a novel challenging suburban dataset show the advantages of modeling structural dependencies between objects.
Finally, motivated by the success of deep learning techniques in matching problems, we present a method for learning context-aware features for solving optical flow using discrete optimization. Towards this goal, we present an efficient way of training a context network with a large receptive field size on top of a local network using dilated convolutions on patches. We perform feature matching by comparing each pixel in the reference image to every pixel in the target image, utilizing fast GPU matrix multiplication. The matching cost volume from the network's output forms the data term for discrete MAP inference in a pairwise Markov random field. Extensive evaluations reveal the importance of context for feature matching.Die visuelle Wahrnehmung von Tiefe und Bewegung spielt eine wichtige Rolle bei dem Verständnis und der Navigation in unserer Umwelt. Die 3D Rekonstruktion von Szenen im Freien und die Schätzung der Bewegung von Videokameras sind von größter Bedeutung für Anwendungen, wie das autonome Fahren.
Die Erforschung der entsprechenden Probleme des maschinellen Sehens hat in den letzten Jahrzehnten enorme Fortschritte gemacht, jedoch bleiben einige Aspekte heute noch ungelöst. Beispiele hierfür sind reflektierende und texturlose Oberflächen oder große Bewegungen, bei denen herkömmliche lokale Methoden häufig scheitern. Weitere Herausforderungen sind niedrige Bildraten, Verdeckungen, große Verzerrungen und schwierige Lichtverhältnisse. In dieser Arbeit schlagen wir vor nicht-lokale Interaktionen zu modellieren, die semantische und kontextbezogene Informationen nutzen, um diese Herausforderungen zu meistern.
Für die binokulare Stereo Schätzung schlagen wir zuallererst vor zusammenhängende Bereiche mit objektklassen-spezifischen Disparitäts Vorschlägen zu regularisieren, die wir mit inversen Grafik Techniken auf der Grundlage einer spärlichen Disparitätsschätzung und semantischen Segmentierung des Bildes erhalten. Die Disparitäts Vorschläge kodieren die Tatsache, dass die Gegenstände bestimmter Kategorien nicht willkürlich geformt sind, sondern typischerweise regelmäßige Strukturen aufweisen. Wir integrieren sie für die komplexe Objektklasse 'Auto' in Form eines nicht-lokalen Regularisierungsterm in ein Superpixel-basiertes grafisches Modell und zeigen die Vorteile vor allem in reflektierenden Bereichen.
Zweitens nutzen wir für die 3D-Rekonstruktion die Tatsache, dass mit der Größe der rekonstruierten Fläche auch die Wahrscheinlichkeit steigt, Objekte von ähnlicher Art und Form in der Szene zu enthalten. Dies gilt besonders für Szenen im Freien, in denen Gebäude und Fahrzeuge oft vorkommen, die unter fehlender Textur oder Reflexionen leiden aber ähnlichkeit in der Form aufweisen. Wir nutzen diese ähnlichkeiten zur Lokalisierung von Objekten mit Detektoren und zur gemeinsamen Rekonstruktion indem ein volumetrisches Modell ihrer Form erlernt wird. Dies ermöglicht auftretendes Rauschen zu reduzieren, während fehlende Flächen vervollständigt werden, da Objekte ähnlicher Form von allen Beobachtungen der jeweiligen Kategorie profitieren. Die Evaluierung auf einem neuen, herausfordernden vorstädtischen Datensatz in Anbetracht von LIDAR-Entfernungsdaten zeigt die Vorteile der Modellierung von strukturellen Abhängigkeiten zwischen Objekten.
Zuletzt, motiviert durch den Erfolg von Deep Learning Techniken bei der Mustererkennung, präsentieren wir eine Methode zum Erlernen von kontextbezogenen Merkmalen zur Lösung des optischen Flusses mittels diskreter Optimierung. Dazu stellen wir eine effiziente Methode vor um zusätzlich zu einem Lokalen Netzwerk ein Kontext-Netzwerk zu erlernen, das mit Hilfe von erweiterter Faltung auf Patches ein großes rezeptives Feld besitzt. Für das Feature Matching vergleichen wir mit schnellen GPU-Matrixmultiplikation jedes Pixel im Referenzbild mit jedem Pixel im Zielbild. Das aus dem Netzwerk resultierende Matching Kostenvolumen bildet den Datenterm für eine diskrete MAP Inferenz in einem paarweisen Markov Random Field. Eine umfangreiche Evaluierung zeigt die Relevanz des Kontextes für das Feature Matching
Object segmentation in depth maps with one user click and a synthetically trained fully convolutional network
With more and more household objects built on planned obsolescence and
consumed by a fast-growing population, hazardous waste recycling has become a
critical challenge. Given the large variability of household waste, current
recycling platforms mostly rely on human operators to analyze the scene,
typically composed of many object instances piled up in bulk. Helping them by
robotizing the unitary extraction is a key challenge to speed up this tedious
process. Whereas supervised deep learning has proven very efficient for such
object-level scene understanding, e.g., generic object detection and
segmentation in everyday scenes, it however requires large sets of per-pixel
labeled images, that are hardly available for numerous application contexts,
including industrial robotics. We thus propose a step towards a practical
interactive application for generating an object-oriented robotic grasp,
requiring as inputs only one depth map of the scene and one user click on the
next object to extract. More precisely, we address in this paper the middle
issue of object seg-mentation in top views of piles of bulk objects given a
pixel location, namely seed, provided interactively by a human operator. We
propose a twofold framework for generating edge-driven instance segments.
First, we repurpose a state-of-the-art fully convolutional object contour
detector for seed-based instance segmentation by introducing the notion of
edge-mask duality with a novel patch-free and contour-oriented loss function.
Second, we train one model using only synthetic scenes, instead of manually
labeled training data. Our experimental results show that considering edge-mask
duality for training an encoder-decoder network, as we suggest, outperforms a
state-of-the-art patch-based network in the present application context.Comment: This is a pre-print of an article published in Human Friendly
Robotics, 10th International Workshop, Springer Proceedings in Advanced
Robotics, vol 7. The final authenticated version is available online at:
https://doi.org/10.1007/978-3-319-89327-3\_16, Springer Proceedings in
Advanced Robotics, Siciliano Bruno, Khatib Oussama, In press, Human Friendly
Robotics, 10th International Workshop,
Une méthode pour l'évaluation de la qualité des images 3D stéréoscopiques.
Dans le contexte d'un intérêt grandissant pour les systèmes stéréoscopiques, mais sans méthodes reproductible pour estimer leur qualité, notre travail propose une contribution à la meilleure compréhension des mécanismes de perception et de jugement humains relatifs au concept multi-dimensionnel de qualité d'image stéréoscopique. Dans cette optique, notre démarche s'est basée sur un certain nombre d'outils : nous avons proposé un cadre adapté afin de structurer le processus d'analyse de la qualité des images stéréoscopiques, nous avons implémenté dans notre laboratoire un système expérimental afin de conduire plusieurs tests, nous avons crée trois bases de données d'images stéréoscopiques contenant des configurations précises et enfin nous avons conduit plusieurs expériences basées sur ces collections d'images. La grande quantité d'information obtenue par l'intermédiaire de ces expérimentations a été utilisée afin de construire un premier modèle mathématique permettant d'expliquer la perception globale de la qualité de la stéréoscopie en fonction des paramètres physiques des images étudiée.In a context of ever-growing interest in stereoscopic systems, but where no standardized algorithmic methods of stereoscopic quality assessment exist, our work stands as a step forward in the understanding of the human perception and judgment mechanisms related to the multidimensional concept of stereoscopic image quality. We used a series of tools in order to perform in-depth investigations in this direction: we proposed an adapted framework to structure the process of stereoscopic quality assessment, we implemented a stereoscopic system in our laboratory for performing various tests, we created three stereoscopic datasets with precise structures, and we performed several experimental studies using these datasets. The numerous experimental data obtained were used in order to propose a first mathematical framework for explaining the overall percept of stereoscopic quality in function of the physical parameters of the stereoscopic images under study.SAVOIE-SCD - Bib.électronique (730659901) / SudocGRENOBLE1/INP-Bib.électronique (384210012) / SudocGRENOBLE2/3-Bib.électronique (384219901) / SudocSudocFranceF
Recommended from our members
"What Not" Detectors Help the Brain See in Depth
Binocular stereopsis is one of the primary cues for three-dimensional (3D) vision in species ranging from insects to primates. Understanding how the brain extracts depth from two different retinal images represents a tractable challenge in sensory neuroscience that has so far evaded full explanation. Central to current thinking is the idea that the brain needs to identify matching features in the two retinal images (i.e., solving the "stereoscopic correspondence problem") so that the depth of objects in the world can be triangulated. Although intuitive, this approach fails to account for key physiological and perceptual observations. We show that formulating the problem to identify "correct matches" is suboptimal and propose an alternative, based on optimal information encoding, that mixes disparity detection with "proscription": exploiting dissimilar features to provide evidence against unlikely interpretations. We demonstrate the role of these "what not" responses in a neural network optimized to extract depth in natural images. The network combines information for and against the likely depth structure of the viewed scene, naturally reproducing key characteristics of both neural responses and perceptual interpretations. We capture the encoding and readout computations of the network in simple analytical form and derive a binocular likelihood model that provides a unified account of long-standing puzzles in 3D vision at the physiological and perceptual levels. We suggest that marrying detection with proscription provides an effective coding strategy for sensory estimation that may be useful for diverse feature domains (e.g., motion) and multisensory integration.Wellcome Trust ( 095183/Z/10/Z )
Event-based Vision: A Survey
Event cameras are bio-inspired sensors that differ from conventional frame
cameras: Instead of capturing images at a fixed rate, they asynchronously
measure per-pixel brightness changes, and output a stream of events that encode
the time, location and sign of the brightness changes. Event cameras offer
attractive properties compared to traditional cameras: high temporal resolution
(in the order of microseconds), very high dynamic range (140 dB vs. 60 dB), low
power consumption, and high pixel bandwidth (on the order of kHz) resulting in
reduced motion blur. Hence, event cameras have a large potential for robotics
and computer vision in challenging scenarios for traditional cameras, such as
low-latency, high speed, and high dynamic range. However, novel methods are
required to process the unconventional output of these sensors in order to
unlock their potential. This paper provides a comprehensive overview of the
emerging field of event-based vision, with a focus on the applications and the
algorithms developed to unlock the outstanding properties of event cameras. We
present event cameras from their working principle, the actual sensors that are
available and the tasks that they have been used for, from low-level vision
(feature detection and tracking, optic flow, etc.) to high-level vision
(reconstruction, segmentation, recognition). We also discuss the techniques
developed to process events, including learning-based techniques, as well as
specialized processors for these novel sensors, such as spiking neural
networks. Additionally, we highlight the challenges that remain to be tackled
and the opportunities that lie ahead in the search for a more efficient,
bio-inspired way for machines to perceive and interact with the world
- …