159,628 research outputs found
Visual Question Answering: A Survey of Methods and Datasets
Visual Question Answering (VQA) is a challenging task that has received
increasing attention from both the computer vision and the natural language
processing communities. Given an image and a question in natural language, it
requires reasoning over visual elements of the image and general knowledge to
infer the correct answer. In the first part of this survey, we examine the
state of the art by comparing modern approaches to the problem. We classify
methods by their mechanism to connect the visual and textual modalities. In
particular, we examine the common approach of combining convolutional and
recurrent neural networks to map images and questions to a common feature
space. We also discuss memory-augmented and modular architectures that
interface with structured knowledge bases. In the second part of this survey,
we review the datasets available for training and evaluating VQA systems. The
various datatsets contain questions at different levels of complexity, which
require different capabilities and types of reasoning. We examine in depth the
question/answer pairs from the Visual Genome project, and evaluate the
relevance of the structured annotations of images with scene graphs for VQA.
Finally, we discuss promising future directions for the field, in particular
the connection to structured knowledge bases and the use of natural language
processing models.Comment: 25 page
Internal representations, external representations and ergonomics: towards a theoretical integration
Gaze Distribution Analysis and Saliency Prediction Across Age Groups
Knowledge of the human visual system helps to develop better computational
models of visual attention. State-of-the-art models have been developed to
mimic the visual attention system of young adults that, however, largely ignore
the variations that occur with age. In this paper, we investigated how visual
scene processing changes with age and we propose an age-adapted framework that
helps to develop a computational model that can predict saliency across
different age groups. Our analysis uncovers how the explorativeness of an
observer varies with age, how well saliency maps of an age group agree with
fixation points of observers from the same or different age groups, and how age
influences the center bias. We analyzed the eye movement behavior of 82
observers belonging to four age groups while they explored visual scenes.
Explorativeness was quantified in terms of the entropy of a saliency map, and
area under the curve (AUC) metrics was used to quantify the agreement analysis
and the center bias. These results were used to develop age adapted saliency
models. Our results suggest that the proposed age-adapted saliency model
outperforms existing saliency models in predicting the regions of interest
across age groups
Toward a model of computational attention based on expressive behavior: applications to cultural heritage scenarios
Our project goals consisted in the development of attention-based analysis of human expressive behavior and the implementation of real-time algorithm in EyesWeb XMI in order to improve naturalness of human-computer interaction and context-based monitoring of human behavior. To this aim, perceptual-model that mimic human attentional processes was developed for expressivity analysis and modeled by entropy. Museum scenarios were selected as an ecological test-bed to elaborate three experiments that focus on visitor profiling and visitors flow regulation
The multisensory body revealed through its cast shadows
One key issue when conceiving the body as a multisensory object is how the cognitive
system integrates visible instances of the self and other bodies with one\u2019s own
somatosensory processing, to achieve self-recognition and body ownership. Recent
research has strongly suggested that shadows cast by our own body have a special
status for cognitive processing, directing attention to the body in a fast and highly specific
manner. The aim of the present article is to review the most recent scientific contributions
addressing how body shadows affect both sensory/perceptual and attentional processes.
The review examines three main points: (1) body shadows as a special window to
investigate the construction of multisensory body perception; (2) experimental paradigms
and related findings; (3) open questions and future trajectories. The reviewed literature
suggests that shadows cast by one\u2019s own body promote binding between personal
and extrapersonal space and elicit automatic orienting of attention toward the bodypart
casting the shadow. Future research should address whether the effects exerted
by body shadows are similar to those observed when observers are exposed to other
visual instances of their body. The results will further clarify the processes underlying the
merging of vision and somatosensation when creating body representations
Change blindness: eradication of gestalt strategies
Arrays of eight, texture-defined rectangles were used as stimuli in a one-shot change blindness (CB) task where there was a 50% chance that one rectangle would change orientation between two successive presentations separated by an interval. CB was eliminated by cueing the target rectangle in the first stimulus, reduced by cueing in the interval and unaffected by cueing in the second presentation. This supports the idea that a representation was formed that persisted through the interval before being 'overwritten' by the second presentation (Landman et al, 2003 Vision Research 43149–164]. Another possibility is that participants used some kind of grouping or Gestalt strategy. To test this we changed the spatial position of the rectangles in the second presentation by shifting them along imaginary spokes (by ±1 degree) emanating from the central fixation point. There was no significant difference seen in performance between this and the standard task [F(1,4)=2.565, p=0.185]. This may suggest two things: (i) Gestalt grouping is not used as a strategy in these tasks, and (ii) it gives further weight to the argument that objects may be stored and retrieved from a pre-attentional store during this task
Attend and Interact: Higher-Order Object Interactions for Video Understanding
Human actions often involve complex interactions across several inter-related
objects in the scene. However, existing approaches to fine-grained video
understanding or visual relationship detection often rely on single object
representation or pairwise object relationships. Furthermore, learning
interactions across multiple objects in hundreds of frames for video is
computationally infeasible and performance may suffer since a large
combinatorial space has to be modeled. In this paper, we propose to efficiently
learn higher-order interactions between arbitrary subgroups of objects for
fine-grained video understanding. We demonstrate that modeling object
interactions significantly improves accuracy for both action recognition and
video captioning, while saving more than 3-times the computation over
traditional pairwise relationships. The proposed method is validated on two
large-scale datasets: Kinetics and ActivityNet Captions. Our SINet and
SINet-Caption achieve state-of-the-art performances on both datasets even
though the videos are sampled at a maximum of 1 FPS. To the best of our
knowledge, this is the first work modeling object interactions on open domain
large-scale video datasets, and we additionally model higher-order object
interactions which improves the performance with low computational costs.Comment: CVPR 201
What does the amygdala contribute to social cognition?
The amygdala has received intense recent attention from neuroscientists investigating its function at the molecular, cellular, systems, cognitive, and clinical level. It clearly contributes to processing emotionally and socially relevant information, yet a unifying description and computational account have been lacking. The difficulty of tying together the various studies stems in part from the sheer diversity of approaches and species studied, in part from the amygdala's inherent heterogeneity in terms of its component nuclei, and in part because different investigators have simply been interested in different topics. Yet, a synthesis now seems close at hand in combining new results from social neuroscience with data from neuroeconomics and reward learning. The amygdala processes a psychological stimulus dimension related to saliency or relevance; mechanisms have been identified to link it to processing unpredictability; and insights from reward learning have situated it within a network of structures that include the prefrontal cortex and the ventral striatum in processing the current value of stimuli. These aspects help to clarify the amygdala's contributions to recognizing emotion from faces, to social behavior toward conspecifics, and to reward learning and instrumental behavior
- …