791 research outputs found

    Blind channel identification/equalization with applications in wireless communications

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Recent Application in Biometrics

    Get PDF
    In the recent years, a number of recognition and authentication systems based on biometric measurements have been proposed. Algorithms and sensors have been developed to acquire and process many different biometric traits. Moreover, the biometric technology is being used in novel ways, with potential commercial and practical implications to our daily activities. The key objective of the book is to provide a collection of comprehensive references on some recent theoretical development as well as novel applications in biometrics. The topics covered in this book reflect well both aspects of development. They include biometric sample quality, privacy preserving and cancellable biometrics, contactless biometrics, novel and unconventional biometrics, and the technical challenges in implementing the technology in portable devices. The book consists of 15 chapters. It is divided into four sections, namely, biometric applications on mobile platforms, cancelable biometrics, biometric encryption, and other applications. The book was reviewed by editors Dr. Jucheng Yang and Dr. Norman Poh. We deeply appreciate the efforts of our guest editors: Dr. Girija Chetty, Dr. Loris Nanni, Dr. Jianjiang Feng, Dr. Dongsun Park and Dr. Sook Yoon, as well as a number of anonymous reviewers

    Recent Advances in Image Restoration with Applications to Real World Problems

    Get PDF
    In the past few decades, imaging hardware has improved tremendously in terms of resolution, making widespread usage of images in many diverse applications on Earth and planetary missions. However, practical issues associated with image acquisition are still affecting image quality. Some of these issues such as blurring, measurement noise, mosaicing artifacts, low spatial or spectral resolution, etc. can seriously affect the accuracy of the aforementioned applications. This book intends to provide the reader with a glimpse of the latest developments and recent advances in image restoration, which includes image super-resolution, image fusion to enhance spatial, spectral resolution, and temporal resolutions, and the generation of synthetic images using deep learning techniques. Some practical applications are also included

    Proceedings of the 2021 Joint Workshop of Fraunhofer IOSB and Institute for Anthropomatics, Vision and Fusion Laboratory

    Get PDF
    2021, the annual joint workshop of the Fraunhofer IOSB and KIT IES was hosted at the IOSB in Karlsruhe. For a week from the 2nd to the 6th July the doctoral students extensive reports on the status of their research. The results and ideas presented at the workshop are collected in this book in the form of detailed technical reports

    Proceedings of the 2021 Joint Workshop of Fraunhofer IOSB and Institute for Anthropomatics, Vision and Fusion Laboratory

    Get PDF
    2021, the annual joint workshop of the Fraunhofer IOSB and KIT IES was hosted at the IOSB in Karlsruhe. For a week from the 2nd to the 6th July the doctoral students extensive reports on the status of their research. The results and ideas presented at the workshop are collected in this book in the form of detailed technical reports

    Multiple-Aspect Analysis of Semantic Trajectories

    Get PDF
    This open access book constitutes the refereed post-conference proceedings of the First International Workshop on Multiple-Aspect Analysis of Semantic Trajectories, MASTER 2019, held in conjunction with the 19th European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2019, in Wรผrzburg, Germany, in September 2019. The 8 full papers presented were carefully reviewed and selected from 12 submissions. They represent an interesting mix of techniques to solve recurrent as well as new problems in the semantic trajectory domain, such as data representation models, data management systems, machine learning approaches for anomaly detection, and common pathways identification

    Topics in Adaptive Optics

    Get PDF
    Advances in adaptive optics technology and applications move forward at a rapid pace. The basic idea of wavefront compensation in real-time has been around since the mid 1970s. The first widely used application of adaptive optics was for compensating atmospheric turbulence effects in astronomical imaging and laser beam propagation. While some topics have been researched and reported for years, even decades, new applications and advances in the supporting technologies occur almost daily. This book brings together 11 original chapters related to adaptive optics, written by an international group of invited authors. Topics include atmospheric turbulence characterization, astronomy with large telescopes, image post-processing, high power laser distortion compensation, adaptive optics and the human eye, wavefront sensors, and deformable mirrors

    Advanced deep learning for medical image segmentation:Towards global and data-efficient learning

    Get PDF

    ์ด์•ผ๊ธฐํ˜• ์„ค๋ช…๋ฌธ์„ ํ™œ์šฉํ•œ ๋Œ€๊ทœ๋ชจ ๋น„๋””์˜ค ํ•™์Šต ์—ฐ๊ตฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021. 2. ๊น€๊ฑดํฌ.Extensive contributions are being made to develop intelligent agents that can recognize and communicate with the world. In this sense, various video-language tasks have drawn a lot of interests in computer vision research, including image/video captioning, video retrieval and video question answering. It can be applied to high-level computer vision tasks and various future industries such as search engines, social marketing, automated driving, and robotics support through QA / dialog generation for the surrounding environment. However, despite these developments, video-language learning suffers from a higher degree of complexity. This thesis investigates methodologies for learning the relationship between videos and free-formed languages, including explanations, conversations, and question-and-answers, so that the machine can easily adapt to target downstream tasks. First, we introduce several methods to learn the relationship between long sentences and videos efficiently. We introduce the approaches for supervising human attention transfer for the video attention model, which shows the video attention mechanism can benefit from explicit human gaze labels. Next, we introduce the end-to-end semantic attention method, which further reduces the visual attention algorithm's complexity by using the representative visual concept word detected by the attention-based detector. As a follow-up study on previous methods, we introduce a JSFusion (Joint Sequence Fusion) method that enables efficient video search and QA by enabling many-to-many matching of attention model. Next, we introduce the CiSIN(Character in Story Identification Network), which uses Attention to increase the performance of character grounding and character re-identification in the movie. Finally, we introduce Transitional Adaptation, which promotes the caption generation models to generates coherent narratives for long videos. In summary, this thesis presents a novel approaches for automatic video description generation/retrieval and shows the benefits of extracting linguistic knowledge for object and motion in the video as well as the advantage of multimodal audio-visual learning for understanding videos. Since the proposed methods are easily adapted to any video-language tasks, it is expected to be applied to the latest models, bringing additional performance improvements. Moving forward, we plan to design an unsupervised video learning framework that can solve many challenges in the industry by integrating an unlimited amount of video, audio, and free-formed language data from the web.์‹œ๊ฐ-์–ธ์–ด ํ•™์Šต์€ ์ด๋ฏธ์ง€/๋น„๋””์˜ค ์บก์…˜(Image/Video captioning), ์‹œ๊ฐ ์งˆ์˜์‘๋‹ต(Visual Question and Answering), ๋น„๋””์˜ค ๊ฒ€์ƒ‰(Video Retrieval), ์žฅ๋ฉด ์ดํ•ด(scene understanding), ์ด๋ฒคํŠธ ์ธ์‹(event detection) ๋“ฑ ๊ณ ์ฐจ์›์˜ ์ปดํ“จํ„ฐ ๋น„์ „ ํƒœ์Šคํฌ(task)๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ฃผ๋ณ€ ํ™˜๊ฒฝ์— ๋Œ€ํ•œ ์งˆ์˜ ์‘๋‹ต ๋ฐ ๋Œ€ํ™” ์ƒ์„ฑ(Dialogue Generation)์œผ๋กœ ์ธํ„ฐ๋„ท ๊ฒ€์ƒ‰ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ตœ๊ทผ ํ™œ๋ฐœํ•œ ์†Œ์…œ ๋งˆ์ผ€ํŒ…(Social Marketing) ์ž์œจ ์ฃผํ–‰(Automated Driving), ๋กœ๋ณดํ‹ฑ์Šค(Robotics)์„ ๋ณด์กฐํ•˜๋Š” ๋“ฑ ์—ฌ๋Ÿฌ ๋ฏธ๋ž˜ ์‚ฐ์—…์— ์ ์šฉ๋  ์ˆ˜ ์žˆ์–ด ํ™œ๋ฐœํžˆ ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ๋Š” ์ค‘์š”ํ•œ ๋ถ„์•ผ์ด๋‹ค. ์ปดํ“จํ„ฐ ๋น„์ ผ๊ณผ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ๋Š” ์ด๋Ÿฌํ•œ ์ค‘์š”์„ฑ์„ ๋ฐ”ํƒ•์œผ๋กœ ๊ฐ์ž ๊ณ ์œ ํ•œ ์˜์—ญ์—์„œ ๋ฐœ์ „์„ ๊ฑฐ๋“ญํ•ด ์™”์œผ๋‚˜, ์ตœ๊ทผ ๋”ฅ๋Ÿฌ๋‹์˜ ๋“ฑ์žฅ๊ณผ ํ•จ๊ป˜ ๋ˆˆ๋ถ€์‹œ๊ฒŒ ๋ฐœ์ „ํ•˜๋ฉด์„œ ์„œ๋กœ๋ฅผ ๋ณด์™„ํ•˜๋ฉฐ ํ•™์Šต ๊ฒฐ๊ณผ๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋“ฑ ํฐ ์‹œ๋„ˆ์ง€ ํšจ๊ณผ๋ฅผ ๋ฐœํœ˜ํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฐ ๋ฐœ์ „์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ๋น„๋””์˜ค-์–ธ์–ด๊ฐ„ ํ•™์Šต์€ ๋ฌธ์ œ์˜ ๋ณต์žก๋„๊ฐ€ ํ•œ์ธต ๋†’์•„ ์–ด๋ ค์›€์„ ๊ฒช๊ฒŒ ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋น„๋””์˜ค์™€ ์ด์— ๋Œ€์‘ํ•˜๋Š” ์„ค๋ช…, ๋Œ€ํ™”, ์งˆ์˜ ์‘๋‹ต ๋“ฑ ๋” ๋‚˜์•„๊ฐ€ ์ž์œ  ํ˜•ํƒœ์˜ ์–ธ์–ด (Free-formed language)๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋”์šฑ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•˜๊ณ , ๋ชฉํ‘œ ์ž„๋ฌด์— ์ž˜ ๋Œ€์‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ฐœ์„ ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. ๋จผ์ €, ์‹œ๊ฐ์  ๋ณต์žก๋„๊ฐ€ ์ด๋ฏธ์ง€๋ณด๋‹ค ๋†’์€ ๋น„๋””์˜ค์™€ ๊ธด ๋ฌธ์žฅ ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ•๋“ค์„ ์†Œ๊ฐœํ•œ๋‹ค. ์ธ๊ฐ„์˜ ์ฃผ์˜ ์ธ์‹(Attention) ๋ชจ๋ธ์„ ๋น„๋””์˜ค-์–ธ์–ด ๋ชจ๋ธ์— ์ง€๋„ ํ•™์Šต ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ•˜๊ณ , ์ด์–ด์„œ ๋น„๋””์˜ค์—์„œ ์šฐ์„  ๊ฒ€์ถœ๋œ ๋Œ€ํ‘œ ์‹œ๊ฐ ๋‹จ์–ด๋ฅผ ๋งค๊ฐœ๋กœ ํ•˜์—ฌ ์ฃผ์˜ ์ธ์‹(Attention) ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋ณต์žก๋„๋ฅผ ๋”์šฑ ์ค„์ด๋Š” ์˜๋ฏธ ์ค‘์‹ฌ ์ฃผ์˜ ์ธ์‹ (Semantic Attention) ๋ฐฉ๋ฒ•, ์–ดํ…์…˜ ๋ชจ๋ธ์˜ ๋‹ค๋Œ€๋‹ค ๋งค์นญ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํšจ์œจ์ ์ธ ๋น„๋””์˜ค ๊ฒ€์ƒ‰ ๋ฐ ์งˆ์˜์‘๋‹ต์„ ๊ฐ€๋Šฅ์ผ€ ํ•˜๋Š” ๋น„๋””์˜ค-์–ธ์–ด๊ฐ„ ์œตํ•ฉ (Joint Sequence Fusion) ๋ฐฉ๋ฒ• ๋“ฑ ๋น„๋””์˜ค ์ฃผ์˜ ์ธ์‹์„ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•๋“ค์„ ์ œ์‹œํ•œ๋‹ค. ๋‹ค์Œ์œผ๋กœ๋Š”, ์ฃผ์˜ ์ธ์‹(Attention) ๋ชจ๋ธ์ด ๋ฌผ์ฒด-๋‹จ์–ด ๊ฐ„ ๊ด€๊ณ„๋ฅผ ๋„˜์–ด ๋น„๋””์˜ค ์ƒ์—์„œ ์ธ๋ฌผ ๊ฒ€์ƒ‰ (Person Searching) ๊ทธ๋ฆฌ๊ณ  ์ธ๋ฌผ ์žฌ ์‹๋ณ„ (Person Re-Identification)์„ ๋™์‹œ์— ์ˆ˜ํ–‰ํ•˜๋ฉฐ ์ƒ์Šน์ž‘์šฉ์„ ์ผ์œผํ‚ค๋Š” ์Šคํ† ๋ฆฌ ์† ์บ๋ฆญํ„ฐ ์ธ์‹ ์‹ ๊ฒฝ๋ง (Character in Story Identification Network) ์„ ์†Œ๊ฐœํ•˜๋ฉฐ, ๋งˆ์ง€๋ง‰์œผ๋กœ ์ž๊ธฐ ์ง€๋„ ํ•™์Šต(Self-supervised Learning)์„ ํ†ตํ•ด ์ฃผ์˜ ์ธ์‹(Attention) ๊ธฐ๋ฐ˜ ์–ธ์–ด ๋ชจ๋ธ์ด ๊ธด ๋น„๋””์˜ค์— ๋Œ€ํ•œ ์„ค๋ช…์„ ์—ฐ๊ด€์„ฑ ์žˆ๊ฒŒ ์ž˜ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋„๋ก ์œ ๋„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค. ์š”์•ฝํ•˜์ž๋ฉด, ์ด ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•œ ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•๋ก ๋“ค์€ ๋น„๋””์˜ค-์–ธ์–ด ํ•™์Šต์— ํ•ด๋‹นํ•˜๋Š” ๋น„๋””์˜ค ์บก์…˜(Video captioning), ๋น„๋””์˜ค ๊ฒ€์ƒ‰(Video Retrieval), ์‹œ๊ฐ ์งˆ์˜์‘๋‹ต(Video Question and Answering)๋“ฑ์„ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ์ˆ ์  ๋””๋”ค๋Œ์ด ๋˜๋ฉฐ, ๋น„๋””์˜ค ์บก์…˜ ํ•™์Šต์„ ํ†ตํ•ด ํ•™์Šต๋œ ์ฃผ์˜ ์ธ์‹ ๋ชจ๋“ˆ์€ ๊ฒ€์ƒ‰ ๋ฐ ์งˆ์˜์‘๋‹ต, ์ธ๋ฌผ ๊ฒ€์ƒ‰ ๋“ฑ ๊ฐ ๋„คํŠธ์›Œํฌ์— ์ด์‹๋˜๋ฉด์„œ ์ƒˆ๋กœ์šด ๋ฌธ์ œ๋“ค์— ๋Œ€ํ•ด ๋™์‹œ์— ์ตœ๊ณ  ์ˆ˜์ค€(State-of-the-art)์˜ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋น„๋””์˜ค-์–ธ์–ด ํ•™์Šต์œผ๋กœ ์–ป์€ ์–ธ์–ด ์ง€์‹์˜ ์ด์ „์€ ์‹œ๊ฐ-์ฒญ๊ฐ์„ ์•„์šฐ๋ฅด๋Š” ๋น„๋””์˜ค ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ•™์Šต์— ํฐ ๋„์›€์ด ๋˜๋Š” ๊ฒƒ์„ ์‹คํ—˜์ ์œผ๋กœ ๋ณด์—ฌ์ค€๋‹ค. ํ–ฅํ›„ ์ž‘์—… ๋ฐฉํ–ฅ (Future Work)์œผ๋กœ๋Š” ์•ž์„œ ์—ฐ๊ตฌํ•œ ๋‚ด์šฉ๋“ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์›น ์†์— ์กด์žฌํ•˜๋Š” ๋Œ€๊ทœ๋ชจ์˜ ์–ธ์–ด, ๋น„๋””์˜ค, ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ฉํ•ด ํ•™์Šต์— ํ™œ์šฉํ•˜์—ฌ ์‚ฐ์—…๊ณ„์˜ ๋งŽ์€ ๋‚œ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š” ๋น„์ง€๋„ ํ•™์Šต ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ ์ž ํ•œ๋‹ค.Chapter 1 Introduction 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 1.2 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . .8 Chapter 2 Related Work 2.1 Video Captioning . . . . . . . . . . . . . . . . . . . . . . . . . . .9 2.2 Video Retrieval with Natural Language . . . . . . . . . . . . . . 12 2.3 Video Question and Answering . . . . . . . . . . . . . . . . . . . 13 2.4 Cross-modal Representation Learning for Vision and LanguageTasks . . . . 15 Chapter 3 Human Attention Transfer for Video Captioning18 3.1 Introduction 3.2 Video Datasets for Caption and Gaze . . . . . . . . . . . . . . . 21 3.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.1 Video Pre-processing and Description . . . . . . . . . . . 22 3.3.2The Recurrent Gaze Prediction (RGP) Model . . . . . . . 23 3.3.3Construction of Visual Feature Pools . . . . . . . . . . . . 24 3.3.4The Decoder for Caption Generation . . . . . . . . . . . . 26 3.3.5Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.1Evaluation of Gaze Prediction . . . . . . . . . . . . . . . . 29 3.4.2Evaluation of Video Captioning . . . . . . . . . . . . . . . 32 3.4.3Human Evaluation via AMT . . . . . . . . . . . . . . . . 35 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Chapter 4 Semantic Word Attention for Video QA and VideoCaptioning 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1.1Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.2Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.1Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.2An Attention Model for Concept Detection . . . . . . . . 42 4.2.3Video-to-Language Models . . . . . . . . . . . . . . . . . 45 4.2.4A Model for Description . . . . . . . . . . . . . . . . . . . 45 4.2.5A Model for Fill-in-the-Blank . . . . . . . . . . . . . . . . 48 4.2.6A Model for Multiple-Choice Test . . . . . . . . . . . . . 50 4.2.7A Model for Retrieval . . . . . . . . . . . . . . . . . . . . 51 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.1The LSMDC Dataset and Tasks . . . . . . . . . . . . . . 52 4.3.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 54 4.3.3Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 56 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Chapter 5 Joint Sequnece Fusion Attention for Multimodal Sequence Data 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3.1Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3.2The Joint Semantic Tensor . . . . . . . . . . . . . . . . . 65 5.3.3The Convolutional Hierarchical Decoder . . . . . . . . . . 66 5.3.4An Illustrative Example of How the JSFusion Model Works 68 5.3.5Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3.6Implementation of Video-Language Models . . . . . . . . 69 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4.1LSMDC Dataset and Tasks . . . . . . . . . . . . . . . . . 71 5.4.2MSR-VTT-(RET/MC) Dataset and Tasks . . . . . . . . . 73 5.4.3Quantitative Results . . . . . . . . . . . . . . . . . . . . . 74 5.4.4Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 76 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Chapter 6 Character Re-Identification and Character Ground-ing for Movie Understanding 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.3.1Video Preprocessing . . . . . . . . . . . . . . . . . . . . . 84 6.3.2Visual Track Embedding . . . . . . . . . . . . . . . . . . . 85 6.3.3Textual Character Embedding . . . . . . . . . . . . . . . 86 6.3.4Character Grounding . . . . . . . . . . . . . . . . . . . . 87 6.3.5Re-Identification . . . . . . . . . . . . . . . . . . . . . . . 88 6.3.6Joint Training . . . . . . . . . . . . . . . . . . . . . . . . 90 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.4.1Experimental Setup . . . . . . . . . . . . . . . . . . . . . 92 6.4.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 93 6.4.3Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 95 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Chapter 7 Transitional Adaptation of Pretrained Models forVisual Storytelling 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.3.1The Visual Encoder . . . . . . . . . . . . . . . . . . . . . 104 7.3.2The Language Generator . . . . . . . . . . . . . . . . . . 104 7.3.3Adaptation training . . . . . . . . . . . . . . . . . . . . . 105 7.3.4The Sequential Coherence Loss . . . . . . . . . . . . . . . 105 7.3.5Training with the adaptation Loss . . . . . . . . . . . . . 107 7.3.6Fine-tuning and Inference . . . . . . . . . . . . . . . . . . 107 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.4.1Experimental Setup . . . . . . . . . . . . . . . . . . . . . 109 7.4.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 112 7.4.3Further Analyses . . . . . . . . . . . . . . . . . . . . . . . 112 7.4.4Human Evaluation Results . . . . . . . . . . . . . . . . . 115 7.4.5Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 116 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Chapter 8 Conclusion 8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 8.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Bibliography ... 123 ์š”์•ฝ ... 148 Acknowledgements ... 150Docto
    • โ€ฆ
    corecore