159,628 research outputs found

    Visual Question Answering: A Survey of Methods and Datasets

    Full text link
    Visual Question Answering (VQA) is a challenging task that has received increasing attention from both the computer vision and the natural language processing communities. Given an image and a question in natural language, it requires reasoning over visual elements of the image and general knowledge to infer the correct answer. In the first part of this survey, we examine the state of the art by comparing modern approaches to the problem. We classify methods by their mechanism to connect the visual and textual modalities. In particular, we examine the common approach of combining convolutional and recurrent neural networks to map images and questions to a common feature space. We also discuss memory-augmented and modular architectures that interface with structured knowledge bases. In the second part of this survey, we review the datasets available for training and evaluating VQA systems. The various datatsets contain questions at different levels of complexity, which require different capabilities and types of reasoning. We examine in depth the question/answer pairs from the Visual Genome project, and evaluate the relevance of the structured annotations of images with scene graphs for VQA. Finally, we discuss promising future directions for the field, in particular the connection to structured knowledge bases and the use of natural language processing models.Comment: 25 page

    Gaze Distribution Analysis and Saliency Prediction Across Age Groups

    Full text link
    Knowledge of the human visual system helps to develop better computational models of visual attention. State-of-the-art models have been developed to mimic the visual attention system of young adults that, however, largely ignore the variations that occur with age. In this paper, we investigated how visual scene processing changes with age and we propose an age-adapted framework that helps to develop a computational model that can predict saliency across different age groups. Our analysis uncovers how the explorativeness of an observer varies with age, how well saliency maps of an age group agree with fixation points of observers from the same or different age groups, and how age influences the center bias. We analyzed the eye movement behavior of 82 observers belonging to four age groups while they explored visual scenes. Explorativeness was quantified in terms of the entropy of a saliency map, and area under the curve (AUC) metrics was used to quantify the agreement analysis and the center bias. These results were used to develop age adapted saliency models. Our results suggest that the proposed age-adapted saliency model outperforms existing saliency models in predicting the regions of interest across age groups

    Toward a model of computational attention based on expressive behavior: applications to cultural heritage scenarios

    Get PDF
    Our project goals consisted in the development of attention-based analysis of human expressive behavior and the implementation of real-time algorithm in EyesWeb XMI in order to improve naturalness of human-computer interaction and context-based monitoring of human behavior. To this aim, perceptual-model that mimic human attentional processes was developed for expressivity analysis and modeled by entropy. Museum scenarios were selected as an ecological test-bed to elaborate three experiments that focus on visitor profiling and visitors flow regulation

    The multisensory body revealed through its cast shadows

    Get PDF
    One key issue when conceiving the body as a multisensory object is how the cognitive system integrates visible instances of the self and other bodies with one\u2019s own somatosensory processing, to achieve self-recognition and body ownership. Recent research has strongly suggested that shadows cast by our own body have a special status for cognitive processing, directing attention to the body in a fast and highly specific manner. The aim of the present article is to review the most recent scientific contributions addressing how body shadows affect both sensory/perceptual and attentional processes. The review examines three main points: (1) body shadows as a special window to investigate the construction of multisensory body perception; (2) experimental paradigms and related findings; (3) open questions and future trajectories. The reviewed literature suggests that shadows cast by one\u2019s own body promote binding between personal and extrapersonal space and elicit automatic orienting of attention toward the bodypart casting the shadow. Future research should address whether the effects exerted by body shadows are similar to those observed when observers are exposed to other visual instances of their body. The results will further clarify the processes underlying the merging of vision and somatosensation when creating body representations

    Change blindness: eradication of gestalt strategies

    Get PDF
    Arrays of eight, texture-defined rectangles were used as stimuli in a one-shot change blindness (CB) task where there was a 50% chance that one rectangle would change orientation between two successive presentations separated by an interval. CB was eliminated by cueing the target rectangle in the first stimulus, reduced by cueing in the interval and unaffected by cueing in the second presentation. This supports the idea that a representation was formed that persisted through the interval before being 'overwritten' by the second presentation (Landman et al, 2003 Vision Research 43149–164]. Another possibility is that participants used some kind of grouping or Gestalt strategy. To test this we changed the spatial position of the rectangles in the second presentation by shifting them along imaginary spokes (by ±1 degree) emanating from the central fixation point. There was no significant difference seen in performance between this and the standard task [F(1,4)=2.565, p=0.185]. This may suggest two things: (i) Gestalt grouping is not used as a strategy in these tasks, and (ii) it gives further weight to the argument that objects may be stored and retrieved from a pre-attentional store during this task

    Attend and Interact: Higher-Order Object Interactions for Video Understanding

    Full text link
    Human actions often involve complex interactions across several inter-related objects in the scene. However, existing approaches to fine-grained video understanding or visual relationship detection often rely on single object representation or pairwise object relationships. Furthermore, learning interactions across multiple objects in hundreds of frames for video is computationally infeasible and performance may suffer since a large combinatorial space has to be modeled. In this paper, we propose to efficiently learn higher-order interactions between arbitrary subgroups of objects for fine-grained video understanding. We demonstrate that modeling object interactions significantly improves accuracy for both action recognition and video captioning, while saving more than 3-times the computation over traditional pairwise relationships. The proposed method is validated on two large-scale datasets: Kinetics and ActivityNet Captions. Our SINet and SINet-Caption achieve state-of-the-art performances on both datasets even though the videos are sampled at a maximum of 1 FPS. To the best of our knowledge, this is the first work modeling object interactions on open domain large-scale video datasets, and we additionally model higher-order object interactions which improves the performance with low computational costs.Comment: CVPR 201

    What does the amygdala contribute to social cognition?

    Get PDF
    The amygdala has received intense recent attention from neuroscientists investigating its function at the molecular, cellular, systems, cognitive, and clinical level. It clearly contributes to processing emotionally and socially relevant information, yet a unifying description and computational account have been lacking. The difficulty of tying together the various studies stems in part from the sheer diversity of approaches and species studied, in part from the amygdala's inherent heterogeneity in terms of its component nuclei, and in part because different investigators have simply been interested in different topics. Yet, a synthesis now seems close at hand in combining new results from social neuroscience with data from neuroeconomics and reward learning. The amygdala processes a psychological stimulus dimension related to saliency or relevance; mechanisms have been identified to link it to processing unpredictability; and insights from reward learning have situated it within a network of structures that include the prefrontal cortex and the ventral striatum in processing the current value of stimuli. These aspects help to clarify the amygdala's contributions to recognizing emotion from faces, to social behavior toward conspecifics, and to reward learning and instrumental behavior
    corecore