5,932 research outputs found

    Objects2action: Classifying and localizing actions without any video example

    Get PDF
    The goal of this paper is to recognize actions in video without the need for examples. Different from traditional zero-shot approaches we do not demand the design and specification of attribute classifiers and class-to-attribute mappings to allow for transfer from seen classes to unseen classes. Our key contribution is objects2action, a semantic word embedding that is spanned by a skip-gram model of thousands of object categories. Action labels are assigned to an object encoding of unseen video based on a convex combination of action and object affinities. Our semantic embedding has three main characteristics to accommodate for the specifics of actions. First, we propose a mechanism to exploit multiple-word descriptions of actions and objects. Second, we incorporate the automated selection of the most responsive objects per action. And finally, we demonstrate how to extend our zero-shot approach to the spatio-temporal localization of actions in video. Experiments on four action datasets demonstrate the potential of our approach

    Object Referring in Videos with Language and Human Gaze

    Full text link
    We investigate the problem of object referring (OR) i.e. to localize a target object in a visual scene coming with a language description. Humans perceive the world more as continued video snippets than as static images, and describe objects not only by their appearance, but also by their spatio-temporal context and motion features. Humans also gaze at the object when they issue a referring expression. Existing works for OR mostly focus on static images only, which fall short in providing many such cues. This paper addresses OR in videos with language and human gaze. To that end, we present a new video dataset for OR, with 30, 000 objects over 5, 000 stereo video sequences annotated for their descriptions and gaze. We further propose a novel network model for OR in videos, by integrating appearance, motion, gaze, and spatio-temporal context into one network. Experimental results show that our method effectively utilizes motion cues, human gaze, and spatio-temporal context. Our method outperforms previousOR methods. For dataset and code, please refer https://people.ee.ethz.ch/~arunv/ORGaze.html.Comment: Accepted to CVPR 2018, 10 pages, 6 figure

    A study in the cognition of individuals’ identity: Solving the problem of singular cognition in object and agent tracking

    Get PDF
    This article compares the ability to track individuals lacking mental states with the ability to track intentional agents. It explains why reference to individuals raises the problem of explaining how cognitive agents track unique individuals and in what sense reference is based on procedures of perceptual-motor and epistemic tracking. We suggest applying the notion of singular-files from theories in perception and semantics to the problem of tracking intentional agents. In order to elucidate the nature of agent-files, three views of the relation between object- and agent-tracking are distinguished: the Independence, Deflationary and Organism-Dependence Views. The correct view is argued to be the latter, which states that perceptual and epistemic tracking of a unique human organism requires tracking both its spatio-temporal object-properties and its agent-properties
    • …
    corecore