114,956 research outputs found
Zero-Shot Event Detection by Multimodal Distributional Semantic Embedding of Videos
We propose a new zero-shot Event Detection method by Multi-modal
Distributional Semantic embedding of videos. Our model embeds object and action
concepts as well as other available modalities from videos into a
distributional semantic space. To our knowledge, this is the first Zero-Shot
event detection model that is built on top of distributional semantics and
extends it in the following directions: (a) semantic embedding of multimodal
information in videos (with focus on the visual modalities), (b) automatically
determining relevance of concepts/attributes to a free text query, which could
be useful for other applications, and (c) retrieving videos by free text event
query (e.g., "changing a vehicle tire") based on their content. We embed videos
into a distributional semantic space and then measure the similarity between
videos and the event query in a free text form. We validated our method on the
large TRECVID MED (Multimedia Event Detection) challenge. Using only the event
title as a query, our method outperformed the state-of-the-art that uses big
descriptions from 12.6% to 13.5% with MAP metric and 0.73 to 0.83 with ROC-AUC
metric. It is also an order of magnitude faster.Comment: To appear in AAAI 201
Taking the bite out of automated naming of characters in TV video
We investigate the problem of automatically labelling appearances of characters in TV or film material
with their names. This is tremendously challenging due to the huge variation in imaged appearance of each character and the weakness and ambiguity of available annotation. However, we demonstrate that high precision can be achieved by combining multiple sources of information, both visual and textual. The principal novelties that we introduce are: (i) automatic generation of time stamped character annotation by aligning subtitles and transcripts; (ii) strengthening the supervisory information by identifying
when characters are speaking. In addition, we incorporate complementary cues of face matching and clothing matching to propose common annotations for face tracks, and consider choices of classifier which can potentially correct errors made in the automatic extraction of training data from the weak textual annotation. Results are presented on episodes of the TV series ââBuffy the Vampire Slayerâ
Learning to Localize and Align Fine-Grained Actions to Sparse Instructions
Automatic generation of textual video descriptions that are time-aligned with
video content is a long-standing goal in computer vision. The task is
challenging due to the difficulty of bridging the semantic gap between the
visual and natural language domains. This paper addresses the task of
automatically generating an alignment between a set of instructions and a first
person video demonstrating an activity. The sparse descriptions and ambiguity
of written instructions create significant alignment challenges. The key to our
approach is the use of egocentric cues to generate a concise set of action
proposals, which are then matched to recipe steps using object recognition and
computational linguistic techniques. We obtain promising results on both the
Extended GTEA Gaze+ dataset and the Bristol Egocentric Object Interactions
Dataset
- âŠ