1,009 research outputs found

    Video Storytelling: Textual Summaries for Events

    Full text link
    Bridging vision and natural language is a longstanding goal in computer vision and multimedia research. While earlier works focus on generating a single-sentence description for visual content, recent works have studied paragraph generation. In this work, we introduce the problem of video storytelling, which aims at generating coherent and succinct stories for long videos. Video storytelling introduces new challenges, mainly due to the diversity of the story and the length and complexity of the video. We propose novel methods to address the challenges. First, we propose a context-aware framework for multimodal embedding learning, where we design a Residual Bidirectional Recurrent Neural Network to leverage contextual information from past and future. Second, we propose a Narrator model to discover the underlying storyline. The Narrator is formulated as a reinforcement learning agent which is trained by directly optimizing the textual metric of the generated story. We evaluate our method on the Video Story dataset, a new dataset that we have collected to enable the study. We compare our method with multiple state-of-the-art baselines, and show that our method achieves better performance, in terms of quantitative measures and user study.Comment: Published in IEEE Transactions on Multimedi

    Hierarchical Photo-Scene Encoder for Album Storytelling

    Full text link
    In this paper, we propose a novel model with a hierarchical photo-scene encoder and a reconstructor for the task of album storytelling. The photo-scene encoder contains two sub-encoders, namely the photo and scene encoders, which are stacked together and behave hierarchically to fully exploit the structure information of the photos within an album. Specifically, the photo encoder generates semantic representation for each photo while exploiting temporal relationships among them. The scene encoder, relying on the obtained photo representations, is responsible for detecting the scene changes and generating scene representations. Subsequently, the decoder dynamically and attentively summarizes the encoded photo and scene representations to generate a sequence of album representations, based on which a story consisting of multiple coherent sentences is generated. In order to fully extract the useful semantic information from an album, a reconstructor is employed to reproduce the summarized album representations based on the hidden states of the decoder. The proposed model can be trained in an end-to-end manner, which results in an improved performance over the state-of-the-arts on the public visual storytelling (VIST) dataset. Ablation studies further demonstrate the effectiveness of the proposed hierarchical photo-scene encoder and reconstructor.Comment: 8 pages, 4 figure

    ์ด์•ผ๊ธฐํ˜• ์„ค๋ช…๋ฌธ์„ ํ™œ์šฉํ•œ ๋Œ€๊ทœ๋ชจ ๋น„๋””์˜ค ํ•™์Šต ์—ฐ๊ตฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021. 2. ๊น€๊ฑดํฌ.Extensive contributions are being made to develop intelligent agents that can recognize and communicate with the world. In this sense, various video-language tasks have drawn a lot of interests in computer vision research, including image/video captioning, video retrieval and video question answering. It can be applied to high-level computer vision tasks and various future industries such as search engines, social marketing, automated driving, and robotics support through QA / dialog generation for the surrounding environment. However, despite these developments, video-language learning suffers from a higher degree of complexity. This thesis investigates methodologies for learning the relationship between videos and free-formed languages, including explanations, conversations, and question-and-answers, so that the machine can easily adapt to target downstream tasks. First, we introduce several methods to learn the relationship between long sentences and videos efficiently. We introduce the approaches for supervising human attention transfer for the video attention model, which shows the video attention mechanism can benefit from explicit human gaze labels. Next, we introduce the end-to-end semantic attention method, which further reduces the visual attention algorithm's complexity by using the representative visual concept word detected by the attention-based detector. As a follow-up study on previous methods, we introduce a JSFusion (Joint Sequence Fusion) method that enables efficient video search and QA by enabling many-to-many matching of attention model. Next, we introduce the CiSIN(Character in Story Identification Network), which uses Attention to increase the performance of character grounding and character re-identification in the movie. Finally, we introduce Transitional Adaptation, which promotes the caption generation models to generates coherent narratives for long videos. In summary, this thesis presents a novel approaches for automatic video description generation/retrieval and shows the benefits of extracting linguistic knowledge for object and motion in the video as well as the advantage of multimodal audio-visual learning for understanding videos. Since the proposed methods are easily adapted to any video-language tasks, it is expected to be applied to the latest models, bringing additional performance improvements. Moving forward, we plan to design an unsupervised video learning framework that can solve many challenges in the industry by integrating an unlimited amount of video, audio, and free-formed language data from the web.์‹œ๊ฐ-์–ธ์–ด ํ•™์Šต์€ ์ด๋ฏธ์ง€/๋น„๋””์˜ค ์บก์…˜(Image/Video captioning), ์‹œ๊ฐ ์งˆ์˜์‘๋‹ต(Visual Question and Answering), ๋น„๋””์˜ค ๊ฒ€์ƒ‰(Video Retrieval), ์žฅ๋ฉด ์ดํ•ด(scene understanding), ์ด๋ฒคํŠธ ์ธ์‹(event detection) ๋“ฑ ๊ณ ์ฐจ์›์˜ ์ปดํ“จํ„ฐ ๋น„์ „ ํƒœ์Šคํฌ(task)๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ฃผ๋ณ€ ํ™˜๊ฒฝ์— ๋Œ€ํ•œ ์งˆ์˜ ์‘๋‹ต ๋ฐ ๋Œ€ํ™” ์ƒ์„ฑ(Dialogue Generation)์œผ๋กœ ์ธํ„ฐ๋„ท ๊ฒ€์ƒ‰ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ตœ๊ทผ ํ™œ๋ฐœํ•œ ์†Œ์…œ ๋งˆ์ผ€ํŒ…(Social Marketing) ์ž์œจ ์ฃผํ–‰(Automated Driving), ๋กœ๋ณดํ‹ฑ์Šค(Robotics)์„ ๋ณด์กฐํ•˜๋Š” ๋“ฑ ์—ฌ๋Ÿฌ ๋ฏธ๋ž˜ ์‚ฐ์—…์— ์ ์šฉ๋  ์ˆ˜ ์žˆ์–ด ํ™œ๋ฐœํžˆ ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ๋Š” ์ค‘์š”ํ•œ ๋ถ„์•ผ์ด๋‹ค. ์ปดํ“จํ„ฐ ๋น„์ ผ๊ณผ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ๋Š” ์ด๋Ÿฌํ•œ ์ค‘์š”์„ฑ์„ ๋ฐ”ํƒ•์œผ๋กœ ๊ฐ์ž ๊ณ ์œ ํ•œ ์˜์—ญ์—์„œ ๋ฐœ์ „์„ ๊ฑฐ๋“ญํ•ด ์™”์œผ๋‚˜, ์ตœ๊ทผ ๋”ฅ๋Ÿฌ๋‹์˜ ๋“ฑ์žฅ๊ณผ ํ•จ๊ป˜ ๋ˆˆ๋ถ€์‹œ๊ฒŒ ๋ฐœ์ „ํ•˜๋ฉด์„œ ์„œ๋กœ๋ฅผ ๋ณด์™„ํ•˜๋ฉฐ ํ•™์Šต ๊ฒฐ๊ณผ๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋“ฑ ํฐ ์‹œ๋„ˆ์ง€ ํšจ๊ณผ๋ฅผ ๋ฐœํœ˜ํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฐ ๋ฐœ์ „์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ๋น„๋””์˜ค-์–ธ์–ด๊ฐ„ ํ•™์Šต์€ ๋ฌธ์ œ์˜ ๋ณต์žก๋„๊ฐ€ ํ•œ์ธต ๋†’์•„ ์–ด๋ ค์›€์„ ๊ฒช๊ฒŒ ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋น„๋””์˜ค์™€ ์ด์— ๋Œ€์‘ํ•˜๋Š” ์„ค๋ช…, ๋Œ€ํ™”, ์งˆ์˜ ์‘๋‹ต ๋“ฑ ๋” ๋‚˜์•„๊ฐ€ ์ž์œ  ํ˜•ํƒœ์˜ ์–ธ์–ด (Free-formed language)๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋”์šฑ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•˜๊ณ , ๋ชฉํ‘œ ์ž„๋ฌด์— ์ž˜ ๋Œ€์‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ฐœ์„ ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. ๋จผ์ €, ์‹œ๊ฐ์  ๋ณต์žก๋„๊ฐ€ ์ด๋ฏธ์ง€๋ณด๋‹ค ๋†’์€ ๋น„๋””์˜ค์™€ ๊ธด ๋ฌธ์žฅ ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ•๋“ค์„ ์†Œ๊ฐœํ•œ๋‹ค. ์ธ๊ฐ„์˜ ์ฃผ์˜ ์ธ์‹(Attention) ๋ชจ๋ธ์„ ๋น„๋””์˜ค-์–ธ์–ด ๋ชจ๋ธ์— ์ง€๋„ ํ•™์Šต ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ•˜๊ณ , ์ด์–ด์„œ ๋น„๋””์˜ค์—์„œ ์šฐ์„  ๊ฒ€์ถœ๋œ ๋Œ€ํ‘œ ์‹œ๊ฐ ๋‹จ์–ด๋ฅผ ๋งค๊ฐœ๋กœ ํ•˜์—ฌ ์ฃผ์˜ ์ธ์‹(Attention) ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋ณต์žก๋„๋ฅผ ๋”์šฑ ์ค„์ด๋Š” ์˜๋ฏธ ์ค‘์‹ฌ ์ฃผ์˜ ์ธ์‹ (Semantic Attention) ๋ฐฉ๋ฒ•, ์–ดํ…์…˜ ๋ชจ๋ธ์˜ ๋‹ค๋Œ€๋‹ค ๋งค์นญ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํšจ์œจ์ ์ธ ๋น„๋””์˜ค ๊ฒ€์ƒ‰ ๋ฐ ์งˆ์˜์‘๋‹ต์„ ๊ฐ€๋Šฅ์ผ€ ํ•˜๋Š” ๋น„๋””์˜ค-์–ธ์–ด๊ฐ„ ์œตํ•ฉ (Joint Sequence Fusion) ๋ฐฉ๋ฒ• ๋“ฑ ๋น„๋””์˜ค ์ฃผ์˜ ์ธ์‹์„ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•๋“ค์„ ์ œ์‹œํ•œ๋‹ค. ๋‹ค์Œ์œผ๋กœ๋Š”, ์ฃผ์˜ ์ธ์‹(Attention) ๋ชจ๋ธ์ด ๋ฌผ์ฒด-๋‹จ์–ด ๊ฐ„ ๊ด€๊ณ„๋ฅผ ๋„˜์–ด ๋น„๋””์˜ค ์ƒ์—์„œ ์ธ๋ฌผ ๊ฒ€์ƒ‰ (Person Searching) ๊ทธ๋ฆฌ๊ณ  ์ธ๋ฌผ ์žฌ ์‹๋ณ„ (Person Re-Identification)์„ ๋™์‹œ์— ์ˆ˜ํ–‰ํ•˜๋ฉฐ ์ƒ์Šน์ž‘์šฉ์„ ์ผ์œผํ‚ค๋Š” ์Šคํ† ๋ฆฌ ์† ์บ๋ฆญํ„ฐ ์ธ์‹ ์‹ ๊ฒฝ๋ง (Character in Story Identification Network) ์„ ์†Œ๊ฐœํ•˜๋ฉฐ, ๋งˆ์ง€๋ง‰์œผ๋กœ ์ž๊ธฐ ์ง€๋„ ํ•™์Šต(Self-supervised Learning)์„ ํ†ตํ•ด ์ฃผ์˜ ์ธ์‹(Attention) ๊ธฐ๋ฐ˜ ์–ธ์–ด ๋ชจ๋ธ์ด ๊ธด ๋น„๋””์˜ค์— ๋Œ€ํ•œ ์„ค๋ช…์„ ์—ฐ๊ด€์„ฑ ์žˆ๊ฒŒ ์ž˜ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋„๋ก ์œ ๋„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค. ์š”์•ฝํ•˜์ž๋ฉด, ์ด ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•œ ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•๋ก ๋“ค์€ ๋น„๋””์˜ค-์–ธ์–ด ํ•™์Šต์— ํ•ด๋‹นํ•˜๋Š” ๋น„๋””์˜ค ์บก์…˜(Video captioning), ๋น„๋””์˜ค ๊ฒ€์ƒ‰(Video Retrieval), ์‹œ๊ฐ ์งˆ์˜์‘๋‹ต(Video Question and Answering)๋“ฑ์„ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ์ˆ ์  ๋””๋”ค๋Œ์ด ๋˜๋ฉฐ, ๋น„๋””์˜ค ์บก์…˜ ํ•™์Šต์„ ํ†ตํ•ด ํ•™์Šต๋œ ์ฃผ์˜ ์ธ์‹ ๋ชจ๋“ˆ์€ ๊ฒ€์ƒ‰ ๋ฐ ์งˆ์˜์‘๋‹ต, ์ธ๋ฌผ ๊ฒ€์ƒ‰ ๋“ฑ ๊ฐ ๋„คํŠธ์›Œํฌ์— ์ด์‹๋˜๋ฉด์„œ ์ƒˆ๋กœ์šด ๋ฌธ์ œ๋“ค์— ๋Œ€ํ•ด ๋™์‹œ์— ์ตœ๊ณ  ์ˆ˜์ค€(State-of-the-art)์˜ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋น„๋””์˜ค-์–ธ์–ด ํ•™์Šต์œผ๋กœ ์–ป์€ ์–ธ์–ด ์ง€์‹์˜ ์ด์ „์€ ์‹œ๊ฐ-์ฒญ๊ฐ์„ ์•„์šฐ๋ฅด๋Š” ๋น„๋””์˜ค ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ•™์Šต์— ํฐ ๋„์›€์ด ๋˜๋Š” ๊ฒƒ์„ ์‹คํ—˜์ ์œผ๋กœ ๋ณด์—ฌ์ค€๋‹ค. ํ–ฅํ›„ ์ž‘์—… ๋ฐฉํ–ฅ (Future Work)์œผ๋กœ๋Š” ์•ž์„œ ์—ฐ๊ตฌํ•œ ๋‚ด์šฉ๋“ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์›น ์†์— ์กด์žฌํ•˜๋Š” ๋Œ€๊ทœ๋ชจ์˜ ์–ธ์–ด, ๋น„๋””์˜ค, ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ฉํ•ด ํ•™์Šต์— ํ™œ์šฉํ•˜์—ฌ ์‚ฐ์—…๊ณ„์˜ ๋งŽ์€ ๋‚œ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š” ๋น„์ง€๋„ ํ•™์Šต ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ ์ž ํ•œ๋‹ค.Chapter 1 Introduction 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 1.2 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . .8 Chapter 2 Related Work 2.1 Video Captioning . . . . . . . . . . . . . . . . . . . . . . . . . . .9 2.2 Video Retrieval with Natural Language . . . . . . . . . . . . . . 12 2.3 Video Question and Answering . . . . . . . . . . . . . . . . . . . 13 2.4 Cross-modal Representation Learning for Vision and LanguageTasks . . . . 15 Chapter 3 Human Attention Transfer for Video Captioning18 3.1 Introduction 3.2 Video Datasets for Caption and Gaze . . . . . . . . . . . . . . . 21 3.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.1 Video Pre-processing and Description . . . . . . . . . . . 22 3.3.2The Recurrent Gaze Prediction (RGP) Model . . . . . . . 23 3.3.3Construction of Visual Feature Pools . . . . . . . . . . . . 24 3.3.4The Decoder for Caption Generation . . . . . . . . . . . . 26 3.3.5Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.1Evaluation of Gaze Prediction . . . . . . . . . . . . . . . . 29 3.4.2Evaluation of Video Captioning . . . . . . . . . . . . . . . 32 3.4.3Human Evaluation via AMT . . . . . . . . . . . . . . . . 35 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Chapter 4 Semantic Word Attention for Video QA and VideoCaptioning 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1.1Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.2Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.1Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.2An Attention Model for Concept Detection . . . . . . . . 42 4.2.3Video-to-Language Models . . . . . . . . . . . . . . . . . 45 4.2.4A Model for Description . . . . . . . . . . . . . . . . . . . 45 4.2.5A Model for Fill-in-the-Blank . . . . . . . . . . . . . . . . 48 4.2.6A Model for Multiple-Choice Test . . . . . . . . . . . . . 50 4.2.7A Model for Retrieval . . . . . . . . . . . . . . . . . . . . 51 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.1The LSMDC Dataset and Tasks . . . . . . . . . . . . . . 52 4.3.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 54 4.3.3Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 56 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Chapter 5 Joint Sequnece Fusion Attention for Multimodal Sequence Data 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3.1Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3.2The Joint Semantic Tensor . . . . . . . . . . . . . . . . . 65 5.3.3The Convolutional Hierarchical Decoder . . . . . . . . . . 66 5.3.4An Illustrative Example of How the JSFusion Model Works 68 5.3.5Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3.6Implementation of Video-Language Models . . . . . . . . 69 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4.1LSMDC Dataset and Tasks . . . . . . . . . . . . . . . . . 71 5.4.2MSR-VTT-(RET/MC) Dataset and Tasks . . . . . . . . . 73 5.4.3Quantitative Results . . . . . . . . . . . . . . . . . . . . . 74 5.4.4Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 76 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Chapter 6 Character Re-Identification and Character Ground-ing for Movie Understanding 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.3.1Video Preprocessing . . . . . . . . . . . . . . . . . . . . . 84 6.3.2Visual Track Embedding . . . . . . . . . . . . . . . . . . . 85 6.3.3Textual Character Embedding . . . . . . . . . . . . . . . 86 6.3.4Character Grounding . . . . . . . . . . . . . . . . . . . . 87 6.3.5Re-Identification . . . . . . . . . . . . . . . . . . . . . . . 88 6.3.6Joint Training . . . . . . . . . . . . . . . . . . . . . . . . 90 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.4.1Experimental Setup . . . . . . . . . . . . . . . . . . . . . 92 6.4.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 93 6.4.3Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 95 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Chapter 7 Transitional Adaptation of Pretrained Models forVisual Storytelling 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.3.1The Visual Encoder . . . . . . . . . . . . . . . . . . . . . 104 7.3.2The Language Generator . . . . . . . . . . . . . . . . . . 104 7.3.3Adaptation training . . . . . . . . . . . . . . . . . . . . . 105 7.3.4The Sequential Coherence Loss . . . . . . . . . . . . . . . 105 7.3.5Training with the adaptation Loss . . . . . . . . . . . . . 107 7.3.6Fine-tuning and Inference . . . . . . . . . . . . . . . . . . 107 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.4.1Experimental Setup . . . . . . . . . . . . . . . . . . . . . 109 7.4.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 112 7.4.3Further Analyses . . . . . . . . . . . . . . . . . . . . . . . 112 7.4.4Human Evaluation Results . . . . . . . . . . . . . . . . . 115 7.4.5Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 116 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Chapter 8 Conclusion 8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 8.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Bibliography ... 123 ์š”์•ฝ ... 148 Acknowledgements ... 150Docto

    Digital video revisited: Storytelling, conferencing, remixing

    Get PDF

    AutoAD: Movie Description in Context

    Full text link
    The objective of this paper is an automatic Audio Description (AD) model that ingests movies and outputs AD in text form. Generating high-quality movie AD is challenging due to the dependency of the descriptions on context, and the limited amount of training data available. In this work, we leverage the power of pretrained foundation models, such as GPT and CLIP, and only train a mapping network that bridges the two models for visually-conditioned text generation. In order to obtain high-quality AD, we make the following four contributions: (i) we incorporate context from the movie clip, AD from previous clips, as well as the subtitles; (ii) we address the lack of training data by pretraining on large-scale datasets, where visual or contextual information is unavailable, e.g. text-only AD without movies or visual captioning datasets without context; (iii) we improve on the currently available AD datasets, by removing label noise in the MAD dataset, and adding character naming information; and (iv) we obtain strong results on the movie AD task compared with previous methods.Comment: CVPR2023 Highlight. Project page: https://www.robots.ox.ac.uk/~vgg/research/autoad

    Visual Storytelling: Captioning of Image Sequences

    Get PDF
    In the space of automated captioning, the task of visual storytelling is a dimension. Given sequences of images as inputs, visual storytelling (VIST) is about automatically generating textual narratives as outputs. Automatically producing stories for an order of pictures or video frames have several potential applications in diverse domains ranging from multimedia consumption to autonomous systems. The task has evolved over recent years and is moving into adolescence. The availability of a dedicated VIST dataset for the task has mainstreamed research for visual storytelling and related sub-tasks. This thesis work systematically reports the developments of standard captioning as a parent task with accompanying facets like dense captioning and gradually delves into the domain of visual storytelling. Existing models proposed for VIST are described by examining respective characteristics and scope. All the methods for VIST adapt from the typical encoder-decoder style design, owing to its success in addressing the standard image captioning task. Several subtle differences in the underlying intentions of these methods for approaching the VIST are subsequently summarized. Additionally, alternate perspectives around the existing approaches are explored by re-modeling and modifying their learning mechanisms. Experiments with different objective functions are reported with subjective comparisons and relevant results. Eventually, the sub-field of character relationships within storytelling is studied and a novel idea called character-centric storytelling is proposed to account for prospective characters in the extent of data modalities
    • โ€ฆ
    corecore