5 research outputs found
Story-based Video Retrieval in TV series using Plot Synopses
We present a novel approach to search for plots in the story-line of structured videos such as TV series. To this end, we propose to align natural language descriptions of the videos, such as plot synopses, with the corresponding shots in the video. Guided by subtitles and person identities the align-ment problem is formulated as an optimization task over all possible assignments and solved efficiently using dynamic programming. We evaluate our approach on a novel dataset comprising of the complete season 5 of Buffy the Vampire Slayer, and show good alignment performance and the abil-ity to retrieve plots in the storyline
A Neural Multi-sequence Alignment TeCHnique (NeuMATCH)
The alignment of heterogeneous sequential data (video to text) is an
important and challenging problem. Standard techniques for this task, including
Dynamic Time Warping (DTW) and Conditional Random Fields (CRFs), suffer from
inherent drawbacks. Mainly, the Markov assumption implies that, given the
immediate past, future alignment decisions are independent of further history.
The separation between similarity computation and alignment decision also
prevents end-to-end training. In this paper, we propose an end-to-end neural
architecture where alignment actions are implemented as moving data between
stacks of Long Short-term Memory (LSTM) blocks. This flexible architecture
supports a large variety of alignment tasks, including one-to-one, one-to-many,
skipping unmatched elements, and (with extensions) non-monotonic alignment.
Extensive experiments on semi-synthetic and real datasets show that our
algorithm outperforms state-of-the-art baselines.Comment: Accepted at CVPR 2018 (Spotlight). arXiv file includes the paper and
the supplemental materia
Contextual Person Identification in Multimedia Data
We propose methods to improve automatic person identification, regardless of the visibility of a face, by integration of multiple cues including multiple modalities and contextual information. We propose a joint learning approach using contextual information from videos to improve learned face models. Further, we integrate additional modalities in a global fusion framework. We evaluate our approaches on a novel TV series data set, consisting of over 100 000 annotated faces