111,285 research outputs found
Multilevel Language and Vision Integration for Text-to-Clip Retrieval
We address the problem of text-based activity retrieval in video. Given a
sentence describing an activity, our task is to retrieve matching clips from an
untrimmed video. To capture the inherent structures present in both text and
video, we introduce a multilevel model that integrates vision and language
features earlier and more tightly than prior work. First, we inject text
features early on when generating clip proposals, to help eliminate unlikely
clips and thus speed up processing and boost performance. Second, to learn a
fine-grained similarity metric for retrieval, we use visual features to
modulate the processing of query sentences at the word level in a recurrent
neural network. A multi-task loss is also employed by adding query
re-generation as an auxiliary task. Our approach significantly outperforms
prior work on two challenging benchmarks: Charades-STA and ActivityNet
Captions.Comment: AAAI 201
Person Search with Natural Language Description
Searching persons in large-scale image databases with the query of natural
language description has important applications in video surveillance. Existing
methods mainly focused on searching persons with image-based or attribute-based
queries, which have major limitations for a practical usage. In this paper, we
study the problem of person search with natural language description. Given the
textual description of a person, the algorithm of the person search is required
to rank all the samples in the person database then retrieve the most relevant
sample corresponding to the queried description. Since there is no person
dataset or benchmark with textual description available, we collect a
large-scale person description dataset with detailed natural language
annotations and person samples from various sources, termed as CUHK Person
Description Dataset (CUHK-PEDES). A wide range of possible models and baselines
have been evaluated and compared on the person search benchmark. An Recurrent
Neural Network with Gated Neural Attention mechanism (GNA-RNN) is proposed to
establish the state-of-the art performance on person search
Elimination of Bias in Introspection: Methodological Advances, Refinements, and Recommendations
Building on past constructive criticism, the present study provides further methodological development focused on the elimination of bias that may occur during first-person observation. First, various sources of errors that may accompany introspection are distinguished based on previous critical literature. Four main errors are classified, namely attentional, attributional, conceptual, and expressional error. Furthermore, methodological recommendations for the possible elimination of these errors have been determined based on the analysis and focused excerpting of introspective scientific literature. The following groups of methodological recommendations were determined: 1) a better focusing of the subject’s attention to their mental processes, 2) providing suitable stimuli, and 3) the sharing of introspective experience between subjects. Furthermore, the potential of adjustments in introspective research designs for eliminating attentional, attributional, conceptual, and expressional error is discussed
Seeing What You're Told: Sentence-Guided Activity Recognition In Video
We present a system that demonstrates how the compositional structure of
events, in concert with the compositional structure of language, can interplay
with the underlying focusing mechanisms in video action recognition, thereby
providing a medium, not only for top-down and bottom-up integration, but also
for multi-modal integration between vision and language. We show how the roles
played by participants (nouns), their characteristics (adjectives), the actions
performed (verbs), the manner of such actions (adverbs), and changing spatial
relations between participants (prepositions) in the form of whole sentential
descriptions mediated by a grammar, guides the activity-recognition process.
Further, the utility and expressiveness of our framework is demonstrated by
performing three separate tasks in the domain of multi-activity videos:
sentence-guided focus of attention, generation of sentential descriptions of
video, and query-based video search, simply by leveraging the framework in
different manners.Comment: To appear in CVPR 201
- …