690 research outputs found
Movie/Script: Alignment and Parsing of Video and Text Transcription
Movies and TV are a rich source of diverse and complex video of people, objects, actions and locales âin the wildâ. Harvesting automatically labeled sequences of actions from video would enable creation of large-scale and highly-varied datasets. To enable such collection, we focus on the task of recovering scene structure in movies and TV series for object tracking and action retrieval. We present a weakly supervised algorithm that uses the screenplay and closed captions to parse a movie into a hierarchy of shots and scenes. Scene boundaries in the movie are aligned with screenplay scene labels and shots are reordered into a sequence of long continuous tracks or threads which allow for more accurate tracking of people, actions and objects. Scene segmentation, alignment, and shot threading are formulated as inference in a unified generative model and a novel hierarchical dynamic programming algorithm that can handle alignment and jump-limited reorderings in linear time is presented. We present quantitative and qualitative results on movie alignment and parsing, and use the recovered structure to improve character naming and retrieval of common actions in several episodes of popular TV series
MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning
We introduce MoviePuzzle, a novel challenge that targets visual narrative
reasoning and holistic movie understanding. Despite the notable progress that
has been witnessed in the realm of video understanding, most prior works fail
to present tasks and models to address holistic video understanding and the
innate visual narrative structures existing in long-form videos. To tackle this
quandary, we put forth MoviePuzzle task that amplifies the temporal feature
learning and structure learning of video models by reshuffling the shot, frame,
and clip layers of movie segments in the presence of video-dialogue
information. We start by establishing a carefully refined dataset based on
MovieNet by dissecting movies into hierarchical layers and randomly permuting
the orders. Besides benchmarking the MoviePuzzle with prior arts on movie
understanding, we devise a Hierarchical Contrastive Movie Clustering (HCMC)
model that considers the underlying structure and visual semantic orders for
movie reordering. Specifically, through a pairwise and contrastive learning
approach, we train models to predict the correct order of each layer. This
equips them with the knack for deciphering the visual narrative structure of
movies and handling the disorder lurking in video data. Experiments show that
our approach outperforms existing state-of-the-art methods on the \MoviePuzzle
benchmark, underscoring its efficacy
Non-disruptive use of light fields in image and video processing
In the age of computational imaging, cameras capture not only an image but also data. This captured additional data can be best used for photo-realistic renderings facilitating numerous post-processing possibilities such as perspective shift, depth scaling, digital refocus, 3D reconstruction, and much more. In computational photography, the light field imaging technology captures the complete volumetric information of a scene. This technology has the highest potential to accelerate immersive experiences towards close-toreality. It has gained significance in both commercial and research domains. However, due to lack of coding and storage formats and also the incompatibility of the tools to process and enable the data, light fields are not exploited to its full potential. This dissertation approaches the integration of light field data to image and video processing. Towards this goal, the representation of light fields using advanced file formats designed for 2D image assemblies to facilitate asset re-usability and interoperability between applications and devices is addressed. The novel 5D light field acquisition and the on-going research on coding frameworks are presented. Multiple techniques for optimised sequencing of light field data are also proposed. As light fields contain complete 3D information of a scene, large amounts of data is captured and is highly redundant in nature. Hence, by pre-processing the data using the proposed approaches, excellent coding performance can be achieved.Im Zeitalter der computergestĂŒtzten Bildgebung erfassen Kameras nicht mehr nur ein Bild, sondern vielmehr auch Daten. Diese erfassten Zusatzdaten lassen sich optimal fĂŒr fotorealistische Renderings nutzen und erlauben zahlreiche Nachbearbeitungsmöglichkeiten, wie Perspektivwechsel, Tiefenskalierung, digitale Nachfokussierung, 3D-Rekonstruktion und vieles mehr. In der computergestĂŒtzten Fotografie erfasst die Lichtfeld-Abbildungstechnologie die vollstĂ€ndige volumetrische Information einer Szene. Diese Technologie bietet dabei das gröĂte Potenzial, immersive Erlebnisse zu mehr RealitĂ€tsnĂ€he zu beschleunigen. Deshalb gewinnt sie sowohl im kommerziellen Sektor als auch im Forschungsbereich zunehmend an Bedeutung. Aufgrund fehlender Kompressions- und Speicherformate sowie der InkompatibilitĂ€t derWerkzeuge zur Verarbeitung und Freigabe der Daten, wird das Potenzial der Lichtfelder nicht voll ausgeschöpft. Diese Dissertation ermöglicht die Integration von Lichtfelddaten in die Bild- und Videoverarbeitung. Hierzu wird die Darstellung von Lichtfeldern mit Hilfe von fortschrittlichen fĂŒr 2D-Bilder entwickelten Dateiformaten erarbeitet, um die Wiederverwendbarkeit von Assets- Dateien und die KompatibilitĂ€t zwischen Anwendungen und GerĂ€ten zu erleichtern. Die neuartige 5D-Lichtfeldaufnahme und die aktuelle Forschung an Kompressions-Rahmenbedingungen werden vorgestellt. Es werden zudem verschiedene Techniken fĂŒr eine optimierte Sequenzierung von Lichtfelddaten vorgeschlagen. Da Lichtfelder die vollstĂ€ndige 3D-Information einer Szene beinhalten, wird eine groĂe Menge an Daten, die in hohem MaĂe redundant sind, erfasst. Die hier vorgeschlagenen AnsĂ€tze zur Datenvorverarbeitung erreichen dabei eine ausgezeichnete Komprimierleistung
Visual Recognition for Dynamic Scenes
abstract: Recognition memory was investigated for naturalistic dynamic scenes. Although visual recognition for static objects and scenes has been investigated previously and found to be extremely robust in terms of fidelity and retention, visual recognition for dynamic scenes has received much less attention. In four experiments, participants view a number of clips from novel films and are then tasked to complete a recognition test containing frames from the previously viewed films and difficult foil frames. Recognition performance is good when foils are taken from other parts of the same film (Experiment 1), but degrades greatly when foils are taken from unseen gaps from within the viewed footage (Experiments 3 and 4). Removing all non-target frames had a serious effect on recognition performance (Experiment 2). Across all experiments, presenting the films as a random series of clips seemed to have no effect on recognition performance. Patterns of accuracy and response latency in Experiments 3 and 4 appear to be a result of a serial-search process. It is concluded that visual representations of dynamic scenes may be stored as units of events, and participant's old/new judgments of individual frames were better characterized by a cued-recall paradigm than traditional recognition judgments.Dissertation/ThesisPh.D. Psychology 201
Event structures in knowledge, pictures and text
This thesis proposes new techniques for mining scripts.
Scripts are essential pieces of common sense knowledge that contain information about everyday scenarios (like going to a restaurant), namely the events that usually happen in a scenario (entering, sitting down, reading the menu...), their typical order (ordering happens before eating), and the participants of these events (customer, waiter, food...).
Because many conventionalized scenarios are shared common sense knowledge and thus are usually not described in standard texts, we propose to elicit sequential descriptions of typical scenario instances via crowdsourcing over the internet. This approach overcomes the implicitness problem and, at the same time, is scalable to large data collections.
To generalize over the input data, we need to mine event and participant paraphrases from the textual sequences. For this task we make use of the structural commonalities in the collected sequential descriptions, which yields much more accurate paraphrases than approaches that do not take structural constraints into account.
We further apply the algorithm we developed for event paraphrasing to parallel standard texts for extracting sentential paraphrases and paraphrase fragments. In this case we consider the discourse structure in a text as a sequential event structure. As for event paraphrasing, the structure-aware paraphrasing approach clearly outperforms systems that do not consider discourse structure.
As a multimodal application, we develop a new resource in which textual event descriptions are grounded in videos, which enables new investigations on action description semantics and a more accurate modeling of event description similarities. This grounding approach also opens up new possibilities for applying the computed script knowledge for automated event recognition in videos.Die vorliegende Dissertation schlĂ€gt neue Techniken zur Berechnung von Skripten vor. Skripte sind essentielle Teile des Allgemeinwissens, die Informationen ĂŒber alltĂ€gliche Szenarien (wie im Restaurant essen) enthalten, nĂ€mlich die Ereignisse, die typischerweise in einem Szenario vorkommen (eintreten, sich setzen, die Karte lesen...), deren typische zeitliche Abfolge (man bestellt bevor man isst), und die Teilnehmer der Ereignisse (ein Gast, der Kellner, das Essen,...).
Da viele konventionalisierte Szenarien implizit geteiltes Allgemeinwissen sind und ĂŒblicherweise nicht detailliert in Texten beschrieben werden, schlagen wir vor, Beschreibungen von typischen Szenario-Instanzen durch sog. âCrowdsourcingâ ĂŒber das Internet zu sammeln. Dieser Ansatz löst das Implizitheits-Problem und lĂ€sst sich gleichzeitig zu groĂen Daten-Sammlungen hochskalieren.
Um ĂŒber die Eingabe-Daten zu generalisieren, mĂŒssen wir in den Text-Sequenzen Paraphrasen fĂŒr Ereignisse und Teilnehmer finden. HierfĂŒr nutzen wir die strukturellen Gemeinsamkeiten dieser Sequenzen, was viel prĂ€zisere Paraphrasen-Information ergibt als Standard-AnsĂ€tze, die strukturelle EinschrĂ€nkungen nicht beachten.
Die Techniken, die wir fĂŒr die Ereignis-Paraphrasierung entwickelt haben, wenden wir auch auf parallele Standard-Texte an, um Paraphrasen auf Satz-Ebene sowie Paraphrasen-Fragmente zu extrahieren. Hier betrachten wir die Diskurs-Struktur eines Textes als sequentielle Ereignis-Struktur. Auch hier liefert der strukturell informierte Ansatz klar bessere Ergebnisse als herkömmliche Systeme, die Diskurs-Struktur nicht in die Berechnung mit einbeziehen.
Als multimodale Anwendung entwickeln wir eine neue Ressource, in der Text-Beschreibungen von Ereignissen mittels zeitlicher Synchronisierung in Videos verankert sind. Dies ermöglicht neue AnsĂ€tze fĂŒr die Erforschung der Semantik von Ereignisbeschreibungen, und erlaubt auĂerdem die Modellierung treffenderer Ereignis-Ăhnlichkeiten. Dieser Schritt der visuellen Verankerung von Text in Videos eröffnet auch neue Möglichkeiten fĂŒr die Anwendung des berechneten Skript-Wissen bei der automatischen Ereigniserkennung in Videos
Film policy and the emergence of the cross-cultural: exploring crossover cinema in Flanders (Belgium)
With several films taking on a cross-cultural character, a certain âcrossover trendâ may be observed within the recent upswing of Flemish cinema (a subdivision of Belgian cinema). This trend is characterized by two major strands: first, migrant and diasporic filmmakers finally seem to be emerging, and second, several filmmakers tend to cross the globe to make their films, hereby minimizing links with Flemish indigenous culture. While paying special attention to the crucial role of film policy in this context, this contribution further investigates the crossover trend by focusing on Turquaze (2010, Kadir Balci) and Altiplano (2009, Peter Brosens & Jessica Woodworth)
Automatic movie analysis and summarisation
Automatic movie analysis is the task of employing Machine Learning methods to the
field of screenplays, movie scripts, and motion pictures to facilitate or enable various
tasks throughout the entirety of a movieâs life-cycle. From helping with making
informed decisions about a new movie script with respect to aspects such as its originality,
similarity to other movies, or even commercial viability, all the way to offering
consumers new and interesting ways of viewing the final movie, many stages in the
life-cycle of a movie stand to benefit from Machine Learning techniques that promise
to reduce human effort, time, or both. Within this field of automatic movie analysis,
this thesis addresses the task of summarising the content of screenplays, enabling users
at any stage to gain a broad understanding of a movie from greatly reduced data. The
contributions of this thesis are four-fold: (i)We introduce ScriptBase, a new large-scale
data set of original movie scripts, annotated with additional meta-information such as
genre and plot tags, cast information, and log- and tag-lines. To our knowledge, Script-
Base is the largest data set of its kind, containing scripts and information for almost
1,000 Hollywood movies. (ii) We present a dynamic summarisation model for the
screenplay domain, which allows for extraction of highly informative and important
scenes from movie scripts. The extracted summaries allow for the content of the original
script to stay largely intact and provide the user with its important parts, while
greatly reducing the script-reading time. (iii) We extend our summarisation model
to capture additional modalities beyond the screenplay text. The model is rendered
multi-modal by introducing visual information obtained from the actual movie and by
extracting scenes from the movie, allowing users to generate visual summaries of motion
pictures. (iv) We devise a novel end-to-end neural network model for generating
natural language screenplay overviews. This model enables the user to generate short
descriptive and informative texts that capture certain aspects of a movie script, such as
its genres, approximate content, or style, allowing them to gain a fast, high-level understanding
of the screenplay. Multiple automatic and human evaluations were carried
out to assess the performance of our models, demonstrating that they are well-suited
for the tasks set out in this thesis, outperforming strong baselines. Furthermore, the
ScriptBase data set has started to gain traction, and is currently used by a number of
other researchers in the field to tackle various tasks relating to screenplays and their
analysis
What Makes A Youth-Produced Film Good? The Youth Audience Perspective
In this article, we explore how youth audiences evaluate the quality of youth-produced films. Our interest stems from a dearth of ways to measure the quality of what youth produce in artistic production processes. As a result, making art in formal learning settings devolves into either romanticized creativity or instrumental work to improve skills in core content areas. We conducted focus groups with 38 youth participants where they viewed four different films produced by the same youth media arts organization that works with young people to produce short-form, autobiographical documentaries. We found that youth focused their evaluations on identifying the films' genre and content and on assessing how well the filmmakers' creative decisions fit with identifications of genre and content. Evaluations were mediated by audiences' expectations and seemed to inform judgments of quality and creativity. We hope that our work can inform the design of formal learning spaces where young people are producing narrative art
Translating Video Content to Natural Language Descriptions
Humans use rich natural language to describe and communicate visual perceptions. In order to provide natural language descriptions for visual content, this paper combines two important ingredients. First, we generate a rich semantic representation of the visual content including e.g. object and activity labels. To predict the semantic representation we learn a CRF to model the relationships between different components of the visual input. And second, we propose to formulate the generation of natural language as a machine translation problem using the semantic representation as source language and the generated sentences as target language. For this we exploit the power of a parallel corpus of videos and textual descriptions and adapt statistical machine translation to translate between our two languages. We evaluate our video descriptions on the TACoS dataset, which contains video snippets aligned with sentence descriptions. Using automatic evaluation and human judgments we show significant improvements over several base line approaches, motivated by prior work. Our translation approach also shows improvements over related work on an image description task
- âŠ