93 research outputs found

    Story-based Video Retrieval in TV series using Plot Synopses

    Full text link
    We present a novel approach to search for plots in the story-line of structured videos such as TV series. To this end, we propose to align natural language descriptions of the videos, such as plot synopses, with the corresponding shots in the video. Guided by subtitles and person identities the align-ment problem is formulated as an optimization task over all possible assignments and solved efficiently using dynamic programming. We evaluate our approach on a novel dataset comprising of the complete season 5 of Buffy the Vampire Slayer, and show good alignment performance and the abil-ity to retrieve plots in the storyline

    A Neural Multi-sequence Alignment TeCHnique (NeuMATCH)

    Full text link
    The alignment of heterogeneous sequential data (video to text) is an important and challenging problem. Standard techniques for this task, including Dynamic Time Warping (DTW) and Conditional Random Fields (CRFs), suffer from inherent drawbacks. Mainly, the Markov assumption implies that, given the immediate past, future alignment decisions are independent of further history. The separation between similarity computation and alignment decision also prevents end-to-end training. In this paper, we propose an end-to-end neural architecture where alignment actions are implemented as moving data between stacks of Long Short-term Memory (LSTM) blocks. This flexible architecture supports a large variety of alignment tasks, including one-to-one, one-to-many, skipping unmatched elements, and (with extensions) non-monotonic alignment. Extensive experiments on semi-synthetic and real datasets show that our algorithm outperforms state-of-the-art baselines.Comment: Accepted at CVPR 2018 (Spotlight). arXiv file includes the paper and the supplemental materia

    Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books

    Get PDF
    Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states evolve through a story. This paper aims to align books to their movie releases in order to provide rich descriptive explanations for visual content that go semantically far beyond the captions available in the current datasets. To align movies and books we propose a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book. We propose a context-aware CNN to combine information from multiple sources. We demonstrate good quantitative performance for movie/book alignment and show several qualitative examples that showcase the diversity of tasks our model can be used for.Natural Sciences and Engineering Research Council of CanadaCanadian Institute for Advanced ResearchSamsung (Firm)Google (Firm)United States. Office of Naval Research (ONR-N00014-14-1-0232

    MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning

    Full text link
    We introduce MoviePuzzle, a novel challenge that targets visual narrative reasoning and holistic movie understanding. Despite the notable progress that has been witnessed in the realm of video understanding, most prior works fail to present tasks and models to address holistic video understanding and the innate visual narrative structures existing in long-form videos. To tackle this quandary, we put forth MoviePuzzle task that amplifies the temporal feature learning and structure learning of video models by reshuffling the shot, frame, and clip layers of movie segments in the presence of video-dialogue information. We start by establishing a carefully refined dataset based on MovieNet by dissecting movies into hierarchical layers and randomly permuting the orders. Besides benchmarking the MoviePuzzle with prior arts on movie understanding, we devise a Hierarchical Contrastive Movie Clustering (HCMC) model that considers the underlying structure and visual semantic orders for movie reordering. Specifically, through a pairwise and contrastive learning approach, we train models to predict the correct order of each layer. This equips them with the knack for deciphering the visual narrative structure of movies and handling the disorder lurking in video data. Experiments show that our approach outperforms existing state-of-the-art methods on the \MoviePuzzle benchmark, underscoring its efficacy

    MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation

    Full text link
    Previous research has studied the task of segmenting cinematic videos into scenes and into narrative acts. However, these studies have overlooked the essential task of multimodal alignment and fusion for effectively and efficiently processing long-form videos (>60min). In this paper, we introduce Multimodal alignmEnt aGgregation and distillAtion (MEGA) for cinematic long-video segmentation. MEGA tackles the challenge by leveraging multiple media modalities. The method coarsely aligns inputs of variable lengths and different modalities with alignment positional encoding. To maintain temporal synchronization while reducing computation, we further introduce an enhanced bottleneck fusion layer which uses temporal alignment. Additionally, MEGA employs a novel contrastive loss to synchronize and transfer labels across modalities, enabling act segmentation from labeled synopsis sentences on video shots. Our experimental results show that MEGA outperforms state-of-the-art methods on MovieNet dataset for scene segmentation (with an Average Precision improvement of +1.19%) and on TRIPOD dataset for act segmentation (with a Total Agreement improvement of +5.51%)Comment: ICCV 2023 accepte

    Structure-aware narrative summarization from multiple views

    Get PDF
    Narratives, such as movies and TV shows, provide a testbed for addressing a variety of challenges in the field of artificial intelligence. They are examples of complex stories where characters and events interact in many ways. Inferring what is happening in a narrative requires modeling long-range dependencies between events, understanding commonsense knowledge and accounting for non-linearities in the presentation of the story. Moreover, narratives are usually long (i.e., there are hundreds of pages in a screenplay and thousands of frames in a video) and cannot be easily processed by standard neural architectures. Movies and TV episodes also include information from multiple sources (i.e., video, audio, text) that are complementary to inferring high-level events and their interactions. Finally, creating large-scale multimodal datasets with narratives containing long videos and aligned textual data is challenging, resulting in small datasets that require data efficient approaches. Most prior work that analyzes narratives does not consider the above challenges all at once. In most cases, text-only approaches focus on full-length narratives with complex semantics and address tasks such as question-answering and summarization, or multimodal approaches are limited to short videos with simpler semantics (e.g., isolated actions and local interactions). In this thesis, we combine these two different directions in addressing narrative summarization. We use all input modalities (i.e., video, audio, text), consider full-length narratives and perform the task of narrative summarization both in a video-to-video setting (i.e., video summarization, trailer generation) and a video-to-text setting (i.e., multimodal abstractive summarization). We hypothesize that information about the narrative structure of movies and TVepisodes can facilitate summarizing them. We introduce the task of Turning Point identification and provide a corresponding dataset called TRIPOD as a means of analyzing the narrative structure of movies. According to screenwriting theory, turning points (e.g., change of plans, major setback, climax) are crucial narrative moments within a movie or TV episode: they define the plot structure and determine its progression and thematic units. We validate that narrative structure contributes to extractive screenplay summarization by testing our hypothesis on a dataset containing TV episodes and summary-specific labels. We further hypothesize that movies should not be viewed as a sequence of scenes from a screenplay or shots from a video and instead be modelled as sparse graphs, where nodes are scenes or shots and edges denote strong semantic relationships between them. We utilize multimodal information for creating movie graphs in the latent space, and find that both graph-related and multimodal information help contextualization and boost performance on extractive summarization. Moving one step further, we also address the task of trailer moment identification, which can be viewed as a specific instiatiation of narrative summarization. We decompose this task, which is challenging and subjective, into two simpler ones: narrativestructure identification, defined again by turning points, and sentiment prediction. We propose a graph-based unsupervised algorithm that uses interpretable criteria for retrieving trailer shots and convert it into an interactive tool with a human in the loop for trailer creation. Semi-automatic trailer shot selection exhibits comparable performance to fully manual selection according to human judges, while minimizing processing time. After identifying salient content in narratives, we next attempt to produce abstractive textual summaries (i.e., video-to-text). We hypothesize that multimodal information is directly important for generating textual summaries, apart from contributing to content selection. For that, we propose a parameter efficient way for incorporating multimodal information into a pre-trained textual summarizer, while training only 3.8% of model parameters, and demonstrate the importance of multimodal information for generating high-quality and factual summaries. The findings of this thesis underline the need to focus on realistic and multimodal settings when addressing narrative analysis and generation tasks

    AutoAD: Movie Description in Context

    Full text link
    The objective of this paper is an automatic Audio Description (AD) model that ingests movies and outputs AD in text form. Generating high-quality movie AD is challenging due to the dependency of the descriptions on context, and the limited amount of training data available. In this work, we leverage the power of pretrained foundation models, such as GPT and CLIP, and only train a mapping network that bridges the two models for visually-conditioned text generation. In order to obtain high-quality AD, we make the following four contributions: (i) we incorporate context from the movie clip, AD from previous clips, as well as the subtitles; (ii) we address the lack of training data by pretraining on large-scale datasets, where visual or contextual information is unavailable, e.g. text-only AD without movies or visual captioning datasets without context; (iii) we improve on the currently available AD datasets, by removing label noise in the MAD dataset, and adding character naming information; and (iv) we obtain strong results on the movie AD task compared with previous methods.Comment: CVPR2023 Highlight. Project page: https://www.robots.ox.ac.uk/~vgg/research/autoad
    • …