5 research outputs found

    Story-based Video Retrieval in TV series using Plot Synopses

    Full text link
    We present a novel approach to search for plots in the story-line of structured videos such as TV series. To this end, we propose to align natural language descriptions of the videos, such as plot synopses, with the corresponding shots in the video. Guided by subtitles and person identities the align-ment problem is formulated as an optimization task over all possible assignments and solved efficiently using dynamic programming. We evaluate our approach on a novel dataset comprising of the complete season 5 of Buffy the Vampire Slayer, and show good alignment performance and the abil-ity to retrieve plots in the storyline

    A Neural Multi-sequence Alignment TeCHnique (NeuMATCH)

    Full text link
    The alignment of heterogeneous sequential data (video to text) is an important and challenging problem. Standard techniques for this task, including Dynamic Time Warping (DTW) and Conditional Random Fields (CRFs), suffer from inherent drawbacks. Mainly, the Markov assumption implies that, given the immediate past, future alignment decisions are independent of further history. The separation between similarity computation and alignment decision also prevents end-to-end training. In this paper, we propose an end-to-end neural architecture where alignment actions are implemented as moving data between stacks of Long Short-term Memory (LSTM) blocks. This flexible architecture supports a large variety of alignment tasks, including one-to-one, one-to-many, skipping unmatched elements, and (with extensions) non-monotonic alignment. Extensive experiments on semi-synthetic and real datasets show that our algorithm outperforms state-of-the-art baselines.Comment: Accepted at CVPR 2018 (Spotlight). arXiv file includes the paper and the supplemental materia

    Contextual Person Identification in Multimedia Data

    Get PDF
    We propose methods to improve automatic person identification, regardless of the visibility of a face, by integration of multiple cues including multiple modalities and contextual information. We propose a joint learning approach using contextual information from videos to improve learned face models. Further, we integrate additional modalities in a global fusion framework. We evaluate our approaches on a novel TV series data set, consisting of over 100 000 annotated faces
    corecore