33,615 research outputs found
Seeing What You're Told: Sentence-Guided Activity Recognition In Video
We present a system that demonstrates how the compositional structure of
events, in concert with the compositional structure of language, can interplay
with the underlying focusing mechanisms in video action recognition, thereby
providing a medium, not only for top-down and bottom-up integration, but also
for multi-modal integration between vision and language. We show how the roles
played by participants (nouns), their characteristics (adjectives), the actions
performed (verbs), the manner of such actions (adverbs), and changing spatial
relations between participants (prepositions) in the form of whole sentential
descriptions mediated by a grammar, guides the activity-recognition process.
Further, the utility and expressiveness of our framework is demonstrated by
performing three separate tasks in the domain of multi-activity videos:
sentence-guided focus of attention, generation of sentential descriptions of
video, and query-based video search, simply by leveraging the framework in
different manners.Comment: To appear in CVPR 201
- …