3 research outputs found

    Modeling human visual behavior in dynamic 360º environments.

    Get PDF
    Virtual reality (VR) is rapidly growing: Advances in hardware, together with the current high computational power, are driving this technology, which has the potential to change the way people consume content, and has been predicted to become the next big computing paradigm. However, although it has become accessible at a consumer level, much still remains unknown about the grammar and visual language in this medium. Understanding and predicting how humans behave in virtual environments remains an open problem, since the visual behavior known for traditional screen-based content does not hold for immersive VR environments: In VR, the user has total control of the camera, and therefore content creators cannot ensure where viewers’ attention will be directed to. This understanding of visual behavior, however, can be crucial in many applications, such as novel compression and rendering techniques, content design, or virtual tourism, among others. Some works have been devoted to analyzing and modeling human visual behavior. Most of them have focused on identifying the content’s regions that attract the observers’ visual attention, resorting to saliency as a topological measure of what part of a virtual scene might be of more interest. When consuming virtual reality content, which can be either static (i.e., 360◦ images) or dynamic (i.e., 360◦ videos), there are many factors that affect human visual behavior, which are mainly associated with the scene shown in the VR video or image (e.g., colors, shapes, movements, etc.), but also depend on the subjects observing it (their mood and background, the task being performed, previous knowledge, etc.). Therefore, all these variables affecting saliency make its prediction a challenging task. This master thesis presents a novel saliency prediction model for VR videos based on a deep learning approach (DL). DL networks have shown outstanding results in image processing tasks, automatically inferring the most relevant information from images. The proposed model is the first to exploit the joint potential of convolutional (CNN) and recurrent (RNN) neural networks to extract and model the inherent spatio-temporal features from videos, employing RNNs to account for temporal information at the time of feature extraction, rather than to post-process spatial features as in previous works. It is also tailored to the particularities of dynamic VR videos, with the use of spherical convolutions and a novel spherical loss function for saliency prediction that work on a 3D space rather than in traditional image space. To facilitate spatio-temporal learning, this work is also the first in including the optical flow between 360◦ frames for saliency prediction, since movement is known to be a highly salient feature in dynamic content. The proposed model was evaluated qualitatively and quantitatively, proving to outperform state-of-the-art works. Moreover, an exhaustive ablation study demonstrates the effectiveness of the different design decisions made throughout the development of the model. <br /

    Predicción de saliencia en videos 360º mediante aprendizaje profundo.

    Get PDF
    El desarrollo de tecnologías de realidad virtual está introduciendo múltiples avances en un gran número de industrias, como en el entretenimiento, la formación profesional y la medicina. Sin embargo, cómo diseñar y mostrar experiencias de forma atractiva, inmersiva, y cómoda para el usuario sigue siendo uno de los principales retos asociados a la realidad virtual, por lo que existe una necesidad de estudiar cómo los usuarios perciben y exploran estos entornos virtuales. Para modelar el comportamiento visual de los usuarios, tradicionalmente se ha recurrido al estudio y análisis de las regiones que tienden a llamar la atención de los usuarios, denominadas regiones salientes. El campo de investigación de predicción de saliencia se encarga de estudiar y predecir la atención del sistema visual humano modelando las probabilidades de recibir fijaciones oculares según los estímulos visuales recibidos. A la hora de tratar de predecir la saliencia de contenido de realidad virtual, los modelos de predicción de saliencia para pantallas tradiciones no se adaptan correctamente a visores de realidad virtual, ya que los usuarios solo ven una región limitada del contenido total y pueden decidir mirar en direcciones concretas. De manera similar, los modelos de predicción de saliencia para imágenes estáticas no se adaptan correctamente a vídeos, ya que información contenida en fotogramas previos podría afectar a la saliencia de fotogramas posteriores, como ocurre en el seguimiento de objetos en movimiento. Es por ello que para la tarea de predicción de saliencia en vídeos inmersivos 360º es necesario desarrollar modelos específicos adaptados a las condiciones de visualización de éstos.A lo largo del desarrollo del proyecto se ha implementado un modelo de red neuronal basado en técnicas actuales de aprendizaje profundo para abordar la tarea de predicción de saliencia en vídeos 360º, prestando especial atención a la dimensión temporal de los vídeos, que parece tener un papel fundamental en la atención visual humana. Comparando el modelo desarrollado con modelos actuales del estado del arte, se han obtenido resultados mejores en todas las métricas empleadas, a la vez que mostrando un comportamiento similar al que se puede ver en observadores reales, lo cual refleja la habilidad del modelo propuesto para imitar el comportamiento visual humano.<br /

    D-SAV360: A Dataset of Gaze Scanpaths on 360° Ambisonic Videos

    Get PDF
    Understanding human visual behavior within virtual reality environments is crucial to fully leverage their potential. While previous research has provided rich visual data from human observers, existing gaze datasets often suffer from the absence of multimodal stimuli. Moreover, no dataset has yet gathered eye gaze trajectories (i.e., scanpaths) for dynamic content with directional ambisonic sound, which is a critical aspect of sound perception by humans. To address this gap, we introduce D-SAV360, a dataset of 4,609 head and eye scanpaths for 360° videos with first-order ambisonics. This dataset enables a more comprehensive study of multimodal interaction on visual behavior in virtual reality environments. We analyze our collected scanpaths from a total of 87 participants viewing 85 different videos and show that various factors such as viewing mode, content type, and gender significantly impact eye movement statistics. We demonstrate the potential of D-SAV360 as a benchmarking resource for state-of-the-art attention prediction models and discuss its possible applications in further research. By providing a comprehensive dataset of eye movement data for dynamic, multimodal virtual environments, our work can facilitate future investigations of visual behavior and attention in virtual reality
    corecore