10 research outputs found

    Implementación y análisis de modelos de predicción de movimientos oculares (scanpaths) en 2D.

    Get PDF
    La atención visual engloba un conjunto de operaciones cognitivas que nos permiten obtener una representación mental del mundo que nos rodea. A su vez, los movimientos oculares son una manifestación cuantificable de los procesos de decisión que rigen dicha atención visual. En los últimos años han surgido diversos modelos para intentar predecir, modelar y analizar los movimientos oculares en una escena. Dos conceptos clave para intentar entender la atención visual son los mapas de saliencia y los scanpaths. Un mapa de saliencia resalta la región o regiones de una escena en la que es más probable que un observador se fije. Un scanpath es el camino concreto que los ojos de un individuo han seguido al observar una escena. Ambos conceptos están directamente relacionados, y es posible calcular un mapa de saliencia a partir de scanpaths o viceversa. En particular, la cuestión de qué características de la imagen son las que afectan a la distribución espacial de las fijaciones ha sido el foco principal de un gran número de investigaciones hasta ahora. Los mapas de saliencia están directamente relacionados con los scanpaths, ya que se pueden considerar su antesala. Sin embargo, los mapas de saliencia dan una visión estática o final de la saliencia de una imagen, mientras que los scanpaths permiten estudiar de forma dinámica el recorrido visual de un observador particular para una escena. En la actualidad existen multitud de modelos de generación de mapas de saliencia, que se pueden calcular agregando el comportamiento estático de multitud de usuarios. En cambio, no existen tantos modelos de predicción de scanpaths, debido a su mayor complejidad. Sin embargo, en este trabajo nos centramos en el estudio de modelos de predicción de scanpaths ya que dan una información más completa y son capaces de simular el comportamiento variable de cada individuo al visualizar una escena. Partiendo del estado del arte, se escogen los modelos con un mejor rendimiento en imágenes naturales, se implementan y se analizan en un banco de pruebas diseñado para estudiar la capacidad de generalización de los modelos utilizando un recopilatorio de bases de datos de un estudio hospitalario. Las métricas utilizadas nos permiten analizar de forma cuantitativa el comportamiento de los modelos, teniendo en cuenta las características espaciotemporales de los scanpaths generados, comprobando que no existe un único modelo que sea mejor en todas las categorías de imágenes analizadas. Debido a la falta de datos que no se basen en tareas visuales de exploración libre, no es posible entrenar un nuevo modelo que supere el rendimiento de los modelos actuales en estas nuevas categorías. Debido a ello, se propone un metamodelo que se compone de un categorizador de imágenes basado en Deep Learning que nos permite escoger siempre el modelo con mejor rendimiento en cada categoría.<br /

    UEyes: Understanding Visual Saliency across User Interface Types

    Get PDF
    Funding Information: This work was supported by Aalto University’s Department of Information and Communications Engineering, the Finnish Center for Artifcial Intelligence (FCAI), the Academy of Finland through the projects Human Automata (grant 328813) and BAD (grant 318559), the Horizon 2020 FET program of the European Union (grant CHISTERA-20-BCI-001), and the European Innovation Council Pathfnder program (SYMBIOTIK project, grant 101071147). We appreciate Chuhan Jiao’s initial implementation of the baseline methods for saliency prediction and active discussion with Yao (Marc) Wang. Publisher Copyright: © 2023 Owner/Author.While user interfaces (UIs) display elements such as images and text in a grid-based layout, UI types differ significantly in the number of elements and how they are displayed. For example, webpage designs rely heavily on images and text, whereas desktop UIs tend to feature numerous small images. To examine how such differences affect the way users look at UIs, we collected and analyzed a large eye-tracking-based dataset, UEyes (62 participants and 1,980 UI screenshots), covering four major UI types: webpage, desktop UI, mobile UI, and poster. We analyze its differences in biases related to such factors as color, location, and gaze direction. We also compare state-of-the-art predictive models and propose improvements for better capturing typical tendencies across UI types. Both the dataset and the models are publicly available.Peer reviewe

    Predicting Visual Attention and Distraction During Visual Search Using Convolutional Neural Networks

    Full text link
    Most studies in computational modeling of visual attention encompass task-free observation of images. Free-viewing saliency considers limited scenarios of daily life. Most visual activities are goal-oriented and demand a great amount of top-down attention control. Visual search task demands more top-down control of attention, compared to free-viewing. In this paper, we present two approaches to model visual attention and distraction of observers during visual search. Our first approach adapts a light-weight free-viewing saliency model to predict eye fixation density maps of human observers over pixels of search images, using a two-stream convolutional encoder-decoder network, trained and evaluated on COCO-Search18 dataset. This method predicts which locations are more distracting when searching for a particular target. Our network achieves good results on standard saliency metrics (AUC-Judd=0.95, AUC-Borji=0.85, sAUC=0.84, NSS=4.64, KLD=0.93, CC=0.72, SIM=0.54, and IG=2.59). Our second approach is object-based and predicts the distractor and target objects during visual search. Distractors are all objects except the target that observers fixate on during search. This method uses a Mask-RCNN segmentation network pre-trained on MS-COCO and fine-tuned on COCO-Search18 dataset. We release our segmentation annotations of targets and distractors in COCO-Search18 for three target categories: bottle, bowl, and car. The average scores over the three categories are: F1-score=0.64, MAP(iou:0.5)=0.57, MAR(iou:0.5)=0.73. Our implementation code in Tensorflow is publicly available at https://github.com/ManooshSamiei/Distraction-Visual-Search .Comment: 33 pages, 24 figures, 12 tables, this is a pre-print manuscript currently under review in Journal of Visio

    Visual and linguistic processes in deep neural networks:A cognitive perspective

    Get PDF
    When people describe an image, there are complex visual and linguistic processes at work. For instance, speakers tend to look at an object right before mentioning it, but not every time. Similarly, during a conversation, speakers can refer to an entity multiple times, using expressions evolving in the common ground. In this thesis, I develop computational models of such visual and linguistic processes, drawing inspiration from theories and findings from cognitive science and psycholinguistics. This work, where I aim to capture the intricate relationship between non-linguistic modalities and language within deep artificial neural networks, contributes to the line of research into multimodal Natural Language Processing. This thesis consists of two parts: (1) modeling human gaze in language use (production and comprehension), and (2) modeling communication strategies in referential tasks in visually grounded dialogue. In the first part, I delve into enhancing image description generation models using eye-tracking data; evaluating the variation in human signals while describing images; and predicting human reading behavior in the form of eye movements. In the second part, I build models quantifying, generating, resolving, and adapting utterances in referential tasks situated within visual and conversational contexts. The outcomes advance our understanding of human visuo-linguistic processes by revealing intricate strategies at play in such processes, and point to the importance of accounting for them when developing and utilizing multimodal models. The findings shed light on how the advancements in artificial intelligence could contribute to advancing the research on crossmodal processes in humans and vice versa

    Visual and linguistic processes in deep neural networks:A cognitive perspective

    Get PDF
    When people describe an image, there are complex visual and linguistic processes at work. For instance, speakers tend to look at an object right before mentioning it, but not every time. Similarly, during a conversation, speakers can refer to an entity multiple times, using expressions evolving in the common ground. In this thesis, I develop computational models of such visual and linguistic processes, drawing inspiration from theories and findings from cognitive science and psycholinguistics. This work, where I aim to capture the intricate relationship between non-linguistic modalities and language within deep artificial neural networks, contributes to the line of research into multimodal Natural Language Processing. This thesis consists of two parts: (1) modeling human gaze in language use (production and comprehension), and (2) modeling communication strategies in referential tasks in visually grounded dialogue. In the first part, I delve into enhancing image description generation models using eye-tracking data; evaluating the variation in human signals while describing images; and predicting human reading behavior in the form of eye movements. In the second part, I build models quantifying, generating, resolving, and adapting utterances in referential tasks situated within visual and conversational contexts. The outcomes advance our understanding of human visuo-linguistic processes by revealing intricate strategies at play in such processes, and point to the importance of accounting for them when developing and utilizing multimodal models. The findings shed light on how the advancements in artificial intelligence could contribute to advancing the research on crossmodal processes in humans and vice versa

    Hierarchical representations for spatio-temporal visual attention: modeling and understanding

    Get PDF
    Mención Internacional en el título de doctorDentro del marco de la Inteligencia Artificial, la Visión Artificial es una disciplina científica que tiene como objetivo simular automaticamente las funciones del sistema visual humano, tratando de resolver tareas como la localización y el reconocimiento de objetos, la detección de eventos o el seguimiento de objetos....Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: Luis Salgado Álvarez de Sotomayor.- Secretario: Ascensión Gallardo Antolín.- Vocal: Jenny Benois Pinea

    Face Image and Video Analysis in Biometrics and Health Applications

    Get PDF
    Computer Vision (CV) enables computers and systems to derive meaningful information from acquired visual inputs, such as images and videos, and make decisions based on the extracted information. Its goal is to acquire, process, analyze, and understand the information by developing a theoretical and algorithmic model. Biometrics are distinctive and measurable human characteristics used to label or describe individuals by combining computer vision with knowledge of human physiology (e.g., face, iris, fingerprint) and behavior (e.g., gait, gaze, voice). Face is one of the most informative biometric traits. Many studies have investigated the human face from the perspectives of various different disciplines, ranging from computer vision, deep learning, to neuroscience and biometrics. In this work, we analyze the face characteristics from digital images and videos in the areas of morphing attack and defense, and autism diagnosis. For face morphing attacks generation, we proposed a transformer based generative adversarial network to generate more visually realistic morphing attacks by combining different losses, such as face matching distance, facial landmark based loss, perceptual loss and pixel-wise mean square error. In face morphing attack detection study, we designed a fusion-based few-shot learning (FSL) method to learn discriminative features from face images for few-shot morphing attack detection (FS-MAD), and extend the current binary detection into multiclass classification, namely, few-shot morphing attack fingerprinting (FS-MAF). In the autism diagnosis study, we developed a discriminative few shot learning method to analyze hour-long video data and explored the fusion of facial dynamics for facial trait classification of autism spectrum disorder (ASD) in three severity levels. The results show outstanding performance of the proposed fusion-based few-shot framework on the dataset. Besides, we further explored the possibility of performing face micro- expression spotting and feature analysis on autism video data to classify ASD and control groups. The results indicate the effectiveness of subtle facial expression changes on autism diagnosis

    Gaze-Based Human-Robot Interaction by the Brunswick Model

    Get PDF
    We present a new paradigm for human-robot interaction based on social signal processing, and in particular on the Brunswick model. Originally, the Brunswick model copes with face-to-face dyadic interaction, assuming that the interactants are communicating through a continuous exchange of non verbal social signals, in addition to the spoken messages. Social signals have to be interpreted, thanks to a proper recognition phase that considers visual and audio information. The Brunswick model allows to quantitatively evaluate the quality of the interaction using statistical tools which measure how effective is the recognition phase. In this paper we cast this theory when one of the interactants is a robot; in this case, the recognition phase performed by the robot and the human have to be revised w.r.t. the original model. The model is applied to Berrick, a recent open-source low-cost robotic head platform, where the gazing is the social signal to be considered
    corecore