111,285 research outputs found

    Multilevel Language and Vision Integration for Text-to-Clip Retrieval

    Full text link
    We address the problem of text-based activity retrieval in video. Given a sentence describing an activity, our task is to retrieve matching clips from an untrimmed video. To capture the inherent structures present in both text and video, we introduce a multilevel model that integrates vision and language features earlier and more tightly than prior work. First, we inject text features early on when generating clip proposals, to help eliminate unlikely clips and thus speed up processing and boost performance. Second, to learn a fine-grained similarity metric for retrieval, we use visual features to modulate the processing of query sentences at the word level in a recurrent neural network. A multi-task loss is also employed by adding query re-generation as an auxiliary task. Our approach significantly outperforms prior work on two challenging benchmarks: Charades-STA and ActivityNet Captions.Comment: AAAI 201

    Person Search with Natural Language Description

    Full text link
    Searching persons in large-scale image databases with the query of natural language description has important applications in video surveillance. Existing methods mainly focused on searching persons with image-based or attribute-based queries, which have major limitations for a practical usage. In this paper, we study the problem of person search with natural language description. Given the textual description of a person, the algorithm of the person search is required to rank all the samples in the person database then retrieve the most relevant sample corresponding to the queried description. Since there is no person dataset or benchmark with textual description available, we collect a large-scale person description dataset with detailed natural language annotations and person samples from various sources, termed as CUHK Person Description Dataset (CUHK-PEDES). A wide range of possible models and baselines have been evaluated and compared on the person search benchmark. An Recurrent Neural Network with Gated Neural Attention mechanism (GNA-RNN) is proposed to establish the state-of-the art performance on person search

    Elimination of Bias in Introspection: Methodological Advances, Refinements, and Recommendations

    Get PDF
    Building on past constructive criticism, the present study provides further methodological development focused on the elimination of bias that may occur during first-person observation. First, various sources of errors that may accompany introspection are distinguished based on previous critical literature. Four main errors are classified, namely attentional, attributional, conceptual, and expressional error. Furthermore, methodological recommendations for the possible elimination of these errors have been determined based on the analysis and focused excerpting of introspective scientific literature. The following groups of methodological recommendations were determined: 1) a better focusing of the subject’s attention to their mental processes, 2) providing suitable stimuli, and 3) the sharing of introspective experience between subjects. Furthermore, the potential of adjustments in introspective research designs for eliminating attentional, attributional, conceptual, and expressional error is discussed

    Seeing What You're Told: Sentence-Guided Activity Recognition In Video

    Get PDF
    We present a system that demonstrates how the compositional structure of events, in concert with the compositional structure of language, can interplay with the underlying focusing mechanisms in video action recognition, thereby providing a medium, not only for top-down and bottom-up integration, but also for multi-modal integration between vision and language. We show how the roles played by participants (nouns), their characteristics (adjectives), the actions performed (verbs), the manner of such actions (adverbs), and changing spatial relations between participants (prepositions) in the form of whole sentential descriptions mediated by a grammar, guides the activity-recognition process. Further, the utility and expressiveness of our framework is demonstrated by performing three separate tasks in the domain of multi-activity videos: sentence-guided focus of attention, generation of sentential descriptions of video, and query-based video search, simply by leveraging the framework in different manners.Comment: To appear in CVPR 201
    • …
    corecore