82,558 research outputs found
Going Deeper with Semantics: Video Activity Interpretation using Semantic Contextualization
A deeper understanding of video activities extends beyond recognition of
underlying concepts such as actions and objects: constructing deep semantic
representations requires reasoning about the semantic relationships among these
concepts, often beyond what is directly observed in the data. To this end, we
propose an energy minimization framework that leverages large-scale commonsense
knowledge bases, such as ConceptNet, to provide contextual cues to establish
semantic relationships among entities directly hypothesized from video signal.
We mathematically express this using the language of Grenander's canonical
pattern generator theory. We show that the use of prior encoded commonsense
knowledge alleviate the need for large annotated training datasets and help
tackle imbalance in training through prior knowledge. Using three different
publicly available datasets - Charades, Microsoft Visual Description Corpus and
Breakfast Actions datasets, we show that the proposed model can generate video
interpretations whose quality is better than those reported by state-of-the-art
approaches, which have substantial training needs. Through extensive
experiments, we show that the use of commonsense knowledge from ConceptNet
allows the proposed approach to handle various challenges such as training data
imbalance, weak features, and complex semantic relationships and visual scenes.Comment: Accepted to WACV 201
Differential recruitment of brain networks following route and cartographic map learning of spatial environments.
An extensive neuroimaging literature has helped characterize the brain regions involved in navigating a spatial environment. Far less is known, however, about the brain networks involved when learning a spatial layout from a cartographic map. To compare the two means of acquiring a spatial representation, participants learned spatial environments either by directly navigating them or learning them from an aerial-view map. While undergoing functional magnetic resonance imaging (fMRI), participants then performed two different tasks to assess knowledge of the spatial environment: a scene and orientation dependent perceptual (SOP) pointing task and a judgment of relative direction (JRD) of landmarks pointing task. We found three brain regions showing significant effects of route vs. map learning during the two tasks. Parahippocampal and retrosplenial cortex showed greater activation following route compared to map learning during the JRD but not SOP task while inferior frontal gyrus showed greater activation following map compared to route learning during the SOP but not JRD task. We interpret our results to suggest that parahippocampal and retrosplenial cortex were involved in translating scene and orientation dependent coordinate information acquired during route learning to a landmark-referenced representation while inferior frontal gyrus played a role in converting primarily landmark-referenced coordinates acquired during map learning to a scene and orientation dependent coordinate system. Together, our results provide novel insight into the different brain networks underlying spatial representations formed during navigation vs. cartographic map learning and provide additional constraints on theoretical models of the neural basis of human spatial representation
Deep learning investigation for chess player attention prediction using eye-tracking and game data
This article reports on an investigation of the use of convolutional neural
networks to predict the visual attention of chess players. The visual attention
model described in this article has been created to generate saliency maps that
capture hierarchical and spatial features of chessboard, in order to predict
the probability fixation for individual pixels Using a skip-layer architecture
of an autoencoder, with a unified decoder, we are able to use multiscale
features to predict saliency of part of the board at different scales, showing
multiple relations between pieces. We have used scan path and fixation data
from players engaged in solving chess problems, to compute 6600 saliency maps
associated to the corresponding chess piece configurations. This corpus is
completed with synthetically generated data from actual games gathered from an
online chess platform. Experiments realized using both scan-paths from chess
players and the CAT2000 saliency dataset of natural images, highlights several
results. Deep features, pretrained on natural images, were found to be helpful
in training visual attention prediction for chess. The proposed neural network
architecture is able to generate meaningful saliency maps on unseen chess
configurations with good scores on standard metrics. This work provides a
baseline for future work on visual attention prediction in similar contexts
Using film cutting in interface design
It has been suggested that computer interfaces could be made more usable if their designers utilized cinematography techniques, which have evolved to guide
the viewer through a narrative despite frequent discontinuities in the presented scene (i.e., cuts between shots). Because of differences between the domains of
film and interface design, it is not straightforward to understand how such techniques can be transferred. May and Barnard (1995) argued that a psychological
model of watching film could support such a transference. This article presents an extended account of this model, which allows identification of the practice of collocation
of objects of interest in the same screen position before and after a cut. To verify that filmmakers do, in fact, use such techniques successfully, eye movements
were measured while participants watched the entirety of a commerciall
Egocentric Spatial Representation in Action and Perception
Neuropsychological findings used to motivate the “two visual systems” hypothesis have been taken to endanger a pair of widely accepted claims about spatial representation in visual experience. The first is the claim that visual experience represents 3-D space around the perceiver using an egocentric frame of reference. The second is the claim that there is a constitutive link between the spatial contents of visual experience and the perceiver’s bodily actions. In this paper, I carefully assess three main sources of evidence for the two visual systems hypothesis and argue that the best interpretation of the evidence is in fact consistent with both claims. I conclude with some brief remarks on the relation between visual consciousness and rational agency
Indexing of fictional video content for event detection and summarisation
This paper presents an approach to movie video indexing that utilises audiovisual analysis to detect important and meaningful temporal video segments, that we term events. We consider three event classes, corresponding to dialogues, action sequences, and montages, where the latter also includes musical sequences. These three event classes are intuitive for a viewer to understand and recognise whilst accounting for over 90% of the content of most movies. To detect events we leverage traditional filmmaking principles and map these to a set of computable low-level audiovisual features. Finite state machines (FSMs) are used to detect when temporal sequences of specific features occur. A set of heuristics, again inspired by filmmaking conventions, are then applied to the output of multiple FSMs to detect the required events. A movie search system, named MovieBrowser, built upon this approach is also described. The overall approach is evaluated against a ground truth of over twenty-three hours of movie content drawn from various genres and consistently obtains high precision and recall for all event classes. A user experiment designed to evaluate the usefulness of an event-based structure for both searching and browsing movie archives is also described and the results indicate the usefulness of the proposed approach
- …