82,558 research outputs found

    Going Deeper with Semantics: Video Activity Interpretation using Semantic Contextualization

    Full text link
    A deeper understanding of video activities extends beyond recognition of underlying concepts such as actions and objects: constructing deep semantic representations requires reasoning about the semantic relationships among these concepts, often beyond what is directly observed in the data. To this end, we propose an energy minimization framework that leverages large-scale commonsense knowledge bases, such as ConceptNet, to provide contextual cues to establish semantic relationships among entities directly hypothesized from video signal. We mathematically express this using the language of Grenander's canonical pattern generator theory. We show that the use of prior encoded commonsense knowledge alleviate the need for large annotated training datasets and help tackle imbalance in training through prior knowledge. Using three different publicly available datasets - Charades, Microsoft Visual Description Corpus and Breakfast Actions datasets, we show that the proposed model can generate video interpretations whose quality is better than those reported by state-of-the-art approaches, which have substantial training needs. Through extensive experiments, we show that the use of commonsense knowledge from ConceptNet allows the proposed approach to handle various challenges such as training data imbalance, weak features, and complex semantic relationships and visual scenes.Comment: Accepted to WACV 201

    Differential recruitment of brain networks following route and cartographic map learning of spatial environments.

    Get PDF
    An extensive neuroimaging literature has helped characterize the brain regions involved in navigating a spatial environment. Far less is known, however, about the brain networks involved when learning a spatial layout from a cartographic map. To compare the two means of acquiring a spatial representation, participants learned spatial environments either by directly navigating them or learning them from an aerial-view map. While undergoing functional magnetic resonance imaging (fMRI), participants then performed two different tasks to assess knowledge of the spatial environment: a scene and orientation dependent perceptual (SOP) pointing task and a judgment of relative direction (JRD) of landmarks pointing task. We found three brain regions showing significant effects of route vs. map learning during the two tasks. Parahippocampal and retrosplenial cortex showed greater activation following route compared to map learning during the JRD but not SOP task while inferior frontal gyrus showed greater activation following map compared to route learning during the SOP but not JRD task. We interpret our results to suggest that parahippocampal and retrosplenial cortex were involved in translating scene and orientation dependent coordinate information acquired during route learning to a landmark-referenced representation while inferior frontal gyrus played a role in converting primarily landmark-referenced coordinates acquired during map learning to a scene and orientation dependent coordinate system. Together, our results provide novel insight into the different brain networks underlying spatial representations formed during navigation vs. cartographic map learning and provide additional constraints on theoretical models of the neural basis of human spatial representation

    Deep learning investigation for chess player attention prediction using eye-tracking and game data

    Get PDF
    This article reports on an investigation of the use of convolutional neural networks to predict the visual attention of chess players. The visual attention model described in this article has been created to generate saliency maps that capture hierarchical and spatial features of chessboard, in order to predict the probability fixation for individual pixels Using a skip-layer architecture of an autoencoder, with a unified decoder, we are able to use multiscale features to predict saliency of part of the board at different scales, showing multiple relations between pieces. We have used scan path and fixation data from players engaged in solving chess problems, to compute 6600 saliency maps associated to the corresponding chess piece configurations. This corpus is completed with synthetically generated data from actual games gathered from an online chess platform. Experiments realized using both scan-paths from chess players and the CAT2000 saliency dataset of natural images, highlights several results. Deep features, pretrained on natural images, were found to be helpful in training visual attention prediction for chess. The proposed neural network architecture is able to generate meaningful saliency maps on unseen chess configurations with good scores on standard metrics. This work provides a baseline for future work on visual attention prediction in similar contexts

    Using film cutting in interface design

    Get PDF
    It has been suggested that computer interfaces could be made more usable if their designers utilized cinematography techniques, which have evolved to guide the viewer through a narrative despite frequent discontinuities in the presented scene (i.e., cuts between shots). Because of differences between the domains of film and interface design, it is not straightforward to understand how such techniques can be transferred. May and Barnard (1995) argued that a psychological model of watching film could support such a transference. This article presents an extended account of this model, which allows identification of the practice of collocation of objects of interest in the same screen position before and after a cut. To verify that filmmakers do, in fact, use such techniques successfully, eye movements were measured while participants watched the entirety of a commerciall

    Egocentric Spatial Representation in Action and Perception

    Get PDF
    Neuropsychological findings used to motivate the “two visual systems” hypothesis have been taken to endanger a pair of widely accepted claims about spatial representation in visual experience. The first is the claim that visual experience represents 3-D space around the perceiver using an egocentric frame of reference. The second is the claim that there is a constitutive link between the spatial contents of visual experience and the perceiver’s bodily actions. In this paper, I carefully assess three main sources of evidence for the two visual systems hypothesis and argue that the best interpretation of the evidence is in fact consistent with both claims. I conclude with some brief remarks on the relation between visual consciousness and rational agency

    Indexing of fictional video content for event detection and summarisation

    Get PDF
    This paper presents an approach to movie video indexing that utilises audiovisual analysis to detect important and meaningful temporal video segments, that we term events. We consider three event classes, corresponding to dialogues, action sequences, and montages, where the latter also includes musical sequences. These three event classes are intuitive for a viewer to understand and recognise whilst accounting for over 90% of the content of most movies. To detect events we leverage traditional filmmaking principles and map these to a set of computable low-level audiovisual features. Finite state machines (FSMs) are used to detect when temporal sequences of specific features occur. A set of heuristics, again inspired by filmmaking conventions, are then applied to the output of multiple FSMs to detect the required events. A movie search system, named MovieBrowser, built upon this approach is also described. The overall approach is evaluated against a ground truth of over twenty-three hours of movie content drawn from various genres and consistently obtains high precision and recall for all event classes. A user experiment designed to evaluate the usefulness of an event-based structure for both searching and browsing movie archives is also described and the results indicate the usefulness of the proposed approach
    corecore