49 research outputs found

    A Transformer based Multi task Model for Attribute based Person Retrieval

    Get PDF
    Person retrieval is a crucial task in video surveillance. While searching for persons-of-interest based on so-called query images gains much interest in the research community, attribute-based approaches are rarely studied. Attribute-based person retrieval takes a personโ€™s semantic attributes as input and provides a ranked list of search results that match the description. Typically, such approaches either build on a pedestrian attribute recognition approach or learn a joint feature space between attribute descriptions and image data. In this work, both approaches are combined in a multi-task model to benefit from the advantages of both procedures. Moreover, transformer modules are incorporated to increase performance further. Experimental evaluation proves the effectiveness of the approach and shows that the proposed architecture outperforms the baselines significantly

    Lidar-based Gait Analysis and Activity Recognition in a 4D Surveillance System

    Get PDF
    This paper presents new approaches for gait and activity analysis based on data streams of a Rotating Multi Beam (RMB) Lidar sensor. The proposed algorithms are embedded into an integrated 4D vision and visualization system, which is able to analyze and interactively display real scenarios in natural outdoor environments with walking pedestrians. The main focus of the investigations are gait based person re-identification during tracking, and recognition of specific activity patterns such as bending, waving, making phone calls and checking the time looking at wristwatches. The descriptors for training and recognition are observed and extracted from realistic outdoor surveillance scenarios, where multiple pedestrians are walking in the field of interest following possibly intersecting trajectories, thus the observations might often be affected by occlusions or background noise. Since there is no public database available for such scenarios, we created and published a new Lidar-based outdoors gait and activity dataset on our website, that contains point cloud sequences of 28 different persons extracted and aggregated from 35 minutes-long measurements. The presented results confirm that both efficient gait-based identification and activity recognition is achievable in the sparse point clouds of a single RMB Lidar sensor. After extracting the people trajectories, we synthesized a free-viewpoint video, where moving avatar models follow the trajectories of the observed pedestrians in real time, ensuring that the leg movements of the animated avatars are synchronized with the real gait cycles observed in the Lidar stream

    Lidar-based Gait Analysis and Activity Recognition in a 4D Surveillance System

    Get PDF

    Proceedings of the 2021 Joint Workshop of Fraunhofer IOSB and Institute for Anthropomatics, Vision and Fusion Laboratory

    Get PDF
    2021, the annual joint workshop of the Fraunhofer IOSB and KIT IES was hosted at the IOSB in Karlsruhe. For a week from the 2nd to the 6th July the doctoral students extensive reports on the status of their research. The results and ideas presented at the workshop are collected in this book in the form of detailed technical reports

    Proceedings of the 2021 Joint Workshop of Fraunhofer IOSB and Institute for Anthropomatics, Vision and Fusion Laboratory

    Get PDF
    2021, the annual joint workshop of the Fraunhofer IOSB and KIT IES was hosted at the IOSB in Karlsruhe. For a week from the 2nd to the 6th July the doctoral students extensive reports on the status of their research. The results and ideas presented at the workshop are collected in this book in the form of detailed technical reports

    Robust density modelling using the student's t-distribution for human action recognition

    Full text link
    The extraction of human features from videos is often inaccurate and prone to outliers. Such outliers can severely affect density modelling when the Gaussian distribution is used as the model since it is highly sensitive to outliers. The Gaussian distribution is also often used as base component of graphical models for recognising human actions in the videos (hidden Markov model and others) and the presence of outliers can significantly affect the recognition accuracy. In contrast, the Student's t-distribution is more robust to outliers and can be exploited to improve the recognition rate in the presence of abnormal data. In this paper, we present an HMM which uses mixtures of t-distributions as observation probabilities and show how experiments over two well-known datasets (Weizmann, MuHAVi) reported a remarkable improvement in classification accuracy. ยฉ 2011 IEEE

    ์ด์•ผ๊ธฐํ˜• ์„ค๋ช…๋ฌธ์„ ํ™œ์šฉํ•œ ๋Œ€๊ทœ๋ชจ ๋น„๋””์˜ค ํ•™์Šต ์—ฐ๊ตฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021. 2. ๊น€๊ฑดํฌ.Extensive contributions are being made to develop intelligent agents that can recognize and communicate with the world. In this sense, various video-language tasks have drawn a lot of interests in computer vision research, including image/video captioning, video retrieval and video question answering. It can be applied to high-level computer vision tasks and various future industries such as search engines, social marketing, automated driving, and robotics support through QA / dialog generation for the surrounding environment. However, despite these developments, video-language learning suffers from a higher degree of complexity. This thesis investigates methodologies for learning the relationship between videos and free-formed languages, including explanations, conversations, and question-and-answers, so that the machine can easily adapt to target downstream tasks. First, we introduce several methods to learn the relationship between long sentences and videos efficiently. We introduce the approaches for supervising human attention transfer for the video attention model, which shows the video attention mechanism can benefit from explicit human gaze labels. Next, we introduce the end-to-end semantic attention method, which further reduces the visual attention algorithm's complexity by using the representative visual concept word detected by the attention-based detector. As a follow-up study on previous methods, we introduce a JSFusion (Joint Sequence Fusion) method that enables efficient video search and QA by enabling many-to-many matching of attention model. Next, we introduce the CiSIN(Character in Story Identification Network), which uses Attention to increase the performance of character grounding and character re-identification in the movie. Finally, we introduce Transitional Adaptation, which promotes the caption generation models to generates coherent narratives for long videos. In summary, this thesis presents a novel approaches for automatic video description generation/retrieval and shows the benefits of extracting linguistic knowledge for object and motion in the video as well as the advantage of multimodal audio-visual learning for understanding videos. Since the proposed methods are easily adapted to any video-language tasks, it is expected to be applied to the latest models, bringing additional performance improvements. Moving forward, we plan to design an unsupervised video learning framework that can solve many challenges in the industry by integrating an unlimited amount of video, audio, and free-formed language data from the web.์‹œ๊ฐ-์–ธ์–ด ํ•™์Šต์€ ์ด๋ฏธ์ง€/๋น„๋””์˜ค ์บก์…˜(Image/Video captioning), ์‹œ๊ฐ ์งˆ์˜์‘๋‹ต(Visual Question and Answering), ๋น„๋””์˜ค ๊ฒ€์ƒ‰(Video Retrieval), ์žฅ๋ฉด ์ดํ•ด(scene understanding), ์ด๋ฒคํŠธ ์ธ์‹(event detection) ๋“ฑ ๊ณ ์ฐจ์›์˜ ์ปดํ“จํ„ฐ ๋น„์ „ ํƒœ์Šคํฌ(task)๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ฃผ๋ณ€ ํ™˜๊ฒฝ์— ๋Œ€ํ•œ ์งˆ์˜ ์‘๋‹ต ๋ฐ ๋Œ€ํ™” ์ƒ์„ฑ(Dialogue Generation)์œผ๋กœ ์ธํ„ฐ๋„ท ๊ฒ€์ƒ‰ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ตœ๊ทผ ํ™œ๋ฐœํ•œ ์†Œ์…œ ๋งˆ์ผ€ํŒ…(Social Marketing) ์ž์œจ ์ฃผํ–‰(Automated Driving), ๋กœ๋ณดํ‹ฑ์Šค(Robotics)์„ ๋ณด์กฐํ•˜๋Š” ๋“ฑ ์—ฌ๋Ÿฌ ๋ฏธ๋ž˜ ์‚ฐ์—…์— ์ ์šฉ๋  ์ˆ˜ ์žˆ์–ด ํ™œ๋ฐœํžˆ ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ๋Š” ์ค‘์š”ํ•œ ๋ถ„์•ผ์ด๋‹ค. ์ปดํ“จํ„ฐ ๋น„์ ผ๊ณผ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ๋Š” ์ด๋Ÿฌํ•œ ์ค‘์š”์„ฑ์„ ๋ฐ”ํƒ•์œผ๋กœ ๊ฐ์ž ๊ณ ์œ ํ•œ ์˜์—ญ์—์„œ ๋ฐœ์ „์„ ๊ฑฐ๋“ญํ•ด ์™”์œผ๋‚˜, ์ตœ๊ทผ ๋”ฅ๋Ÿฌ๋‹์˜ ๋“ฑ์žฅ๊ณผ ํ•จ๊ป˜ ๋ˆˆ๋ถ€์‹œ๊ฒŒ ๋ฐœ์ „ํ•˜๋ฉด์„œ ์„œ๋กœ๋ฅผ ๋ณด์™„ํ•˜๋ฉฐ ํ•™์Šต ๊ฒฐ๊ณผ๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋“ฑ ํฐ ์‹œ๋„ˆ์ง€ ํšจ๊ณผ๋ฅผ ๋ฐœํœ˜ํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฐ ๋ฐœ์ „์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ๋น„๋””์˜ค-์–ธ์–ด๊ฐ„ ํ•™์Šต์€ ๋ฌธ์ œ์˜ ๋ณต์žก๋„๊ฐ€ ํ•œ์ธต ๋†’์•„ ์–ด๋ ค์›€์„ ๊ฒช๊ฒŒ ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋น„๋””์˜ค์™€ ์ด์— ๋Œ€์‘ํ•˜๋Š” ์„ค๋ช…, ๋Œ€ํ™”, ์งˆ์˜ ์‘๋‹ต ๋“ฑ ๋” ๋‚˜์•„๊ฐ€ ์ž์œ  ํ˜•ํƒœ์˜ ์–ธ์–ด (Free-formed language)๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋”์šฑ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•˜๊ณ , ๋ชฉํ‘œ ์ž„๋ฌด์— ์ž˜ ๋Œ€์‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ฐœ์„ ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. ๋จผ์ €, ์‹œ๊ฐ์  ๋ณต์žก๋„๊ฐ€ ์ด๋ฏธ์ง€๋ณด๋‹ค ๋†’์€ ๋น„๋””์˜ค์™€ ๊ธด ๋ฌธ์žฅ ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ•๋“ค์„ ์†Œ๊ฐœํ•œ๋‹ค. ์ธ๊ฐ„์˜ ์ฃผ์˜ ์ธ์‹(Attention) ๋ชจ๋ธ์„ ๋น„๋””์˜ค-์–ธ์–ด ๋ชจ๋ธ์— ์ง€๋„ ํ•™์Šต ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ•˜๊ณ , ์ด์–ด์„œ ๋น„๋””์˜ค์—์„œ ์šฐ์„  ๊ฒ€์ถœ๋œ ๋Œ€ํ‘œ ์‹œ๊ฐ ๋‹จ์–ด๋ฅผ ๋งค๊ฐœ๋กœ ํ•˜์—ฌ ์ฃผ์˜ ์ธ์‹(Attention) ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋ณต์žก๋„๋ฅผ ๋”์šฑ ์ค„์ด๋Š” ์˜๋ฏธ ์ค‘์‹ฌ ์ฃผ์˜ ์ธ์‹ (Semantic Attention) ๋ฐฉ๋ฒ•, ์–ดํ…์…˜ ๋ชจ๋ธ์˜ ๋‹ค๋Œ€๋‹ค ๋งค์นญ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํšจ์œจ์ ์ธ ๋น„๋””์˜ค ๊ฒ€์ƒ‰ ๋ฐ ์งˆ์˜์‘๋‹ต์„ ๊ฐ€๋Šฅ์ผ€ ํ•˜๋Š” ๋น„๋””์˜ค-์–ธ์–ด๊ฐ„ ์œตํ•ฉ (Joint Sequence Fusion) ๋ฐฉ๋ฒ• ๋“ฑ ๋น„๋””์˜ค ์ฃผ์˜ ์ธ์‹์„ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•๋“ค์„ ์ œ์‹œํ•œ๋‹ค. ๋‹ค์Œ์œผ๋กœ๋Š”, ์ฃผ์˜ ์ธ์‹(Attention) ๋ชจ๋ธ์ด ๋ฌผ์ฒด-๋‹จ์–ด ๊ฐ„ ๊ด€๊ณ„๋ฅผ ๋„˜์–ด ๋น„๋””์˜ค ์ƒ์—์„œ ์ธ๋ฌผ ๊ฒ€์ƒ‰ (Person Searching) ๊ทธ๋ฆฌ๊ณ  ์ธ๋ฌผ ์žฌ ์‹๋ณ„ (Person Re-Identification)์„ ๋™์‹œ์— ์ˆ˜ํ–‰ํ•˜๋ฉฐ ์ƒ์Šน์ž‘์šฉ์„ ์ผ์œผํ‚ค๋Š” ์Šคํ† ๋ฆฌ ์† ์บ๋ฆญํ„ฐ ์ธ์‹ ์‹ ๊ฒฝ๋ง (Character in Story Identification Network) ์„ ์†Œ๊ฐœํ•˜๋ฉฐ, ๋งˆ์ง€๋ง‰์œผ๋กœ ์ž๊ธฐ ์ง€๋„ ํ•™์Šต(Self-supervised Learning)์„ ํ†ตํ•ด ์ฃผ์˜ ์ธ์‹(Attention) ๊ธฐ๋ฐ˜ ์–ธ์–ด ๋ชจ๋ธ์ด ๊ธด ๋น„๋””์˜ค์— ๋Œ€ํ•œ ์„ค๋ช…์„ ์—ฐ๊ด€์„ฑ ์žˆ๊ฒŒ ์ž˜ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋„๋ก ์œ ๋„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค. ์š”์•ฝํ•˜์ž๋ฉด, ์ด ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•œ ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•๋ก ๋“ค์€ ๋น„๋””์˜ค-์–ธ์–ด ํ•™์Šต์— ํ•ด๋‹นํ•˜๋Š” ๋น„๋””์˜ค ์บก์…˜(Video captioning), ๋น„๋””์˜ค ๊ฒ€์ƒ‰(Video Retrieval), ์‹œ๊ฐ ์งˆ์˜์‘๋‹ต(Video Question and Answering)๋“ฑ์„ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ์ˆ ์  ๋””๋”ค๋Œ์ด ๋˜๋ฉฐ, ๋น„๋””์˜ค ์บก์…˜ ํ•™์Šต์„ ํ†ตํ•ด ํ•™์Šต๋œ ์ฃผ์˜ ์ธ์‹ ๋ชจ๋“ˆ์€ ๊ฒ€์ƒ‰ ๋ฐ ์งˆ์˜์‘๋‹ต, ์ธ๋ฌผ ๊ฒ€์ƒ‰ ๋“ฑ ๊ฐ ๋„คํŠธ์›Œํฌ์— ์ด์‹๋˜๋ฉด์„œ ์ƒˆ๋กœ์šด ๋ฌธ์ œ๋“ค์— ๋Œ€ํ•ด ๋™์‹œ์— ์ตœ๊ณ  ์ˆ˜์ค€(State-of-the-art)์˜ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋น„๋””์˜ค-์–ธ์–ด ํ•™์Šต์œผ๋กœ ์–ป์€ ์–ธ์–ด ์ง€์‹์˜ ์ด์ „์€ ์‹œ๊ฐ-์ฒญ๊ฐ์„ ์•„์šฐ๋ฅด๋Š” ๋น„๋””์˜ค ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ•™์Šต์— ํฐ ๋„์›€์ด ๋˜๋Š” ๊ฒƒ์„ ์‹คํ—˜์ ์œผ๋กœ ๋ณด์—ฌ์ค€๋‹ค. ํ–ฅํ›„ ์ž‘์—… ๋ฐฉํ–ฅ (Future Work)์œผ๋กœ๋Š” ์•ž์„œ ์—ฐ๊ตฌํ•œ ๋‚ด์šฉ๋“ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์›น ์†์— ์กด์žฌํ•˜๋Š” ๋Œ€๊ทœ๋ชจ์˜ ์–ธ์–ด, ๋น„๋””์˜ค, ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ฉํ•ด ํ•™์Šต์— ํ™œ์šฉํ•˜์—ฌ ์‚ฐ์—…๊ณ„์˜ ๋งŽ์€ ๋‚œ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š” ๋น„์ง€๋„ ํ•™์Šต ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ ์ž ํ•œ๋‹ค.Chapter 1 Introduction 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 1.2 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . .8 Chapter 2 Related Work 2.1 Video Captioning . . . . . . . . . . . . . . . . . . . . . . . . . . .9 2.2 Video Retrieval with Natural Language . . . . . . . . . . . . . . 12 2.3 Video Question and Answering . . . . . . . . . . . . . . . . . . . 13 2.4 Cross-modal Representation Learning for Vision and LanguageTasks . . . . 15 Chapter 3 Human Attention Transfer for Video Captioning18 3.1 Introduction 3.2 Video Datasets for Caption and Gaze . . . . . . . . . . . . . . . 21 3.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.1 Video Pre-processing and Description . . . . . . . . . . . 22 3.3.2The Recurrent Gaze Prediction (RGP) Model . . . . . . . 23 3.3.3Construction of Visual Feature Pools . . . . . . . . . . . . 24 3.3.4The Decoder for Caption Generation . . . . . . . . . . . . 26 3.3.5Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.1Evaluation of Gaze Prediction . . . . . . . . . . . . . . . . 29 3.4.2Evaluation of Video Captioning . . . . . . . . . . . . . . . 32 3.4.3Human Evaluation via AMT . . . . . . . . . . . . . . . . 35 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Chapter 4 Semantic Word Attention for Video QA and VideoCaptioning 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1.1Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.2Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.1Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.2An Attention Model for Concept Detection . . . . . . . . 42 4.2.3Video-to-Language Models . . . . . . . . . . . . . . . . . 45 4.2.4A Model for Description . . . . . . . . . . . . . . . . . . . 45 4.2.5A Model for Fill-in-the-Blank . . . . . . . . . . . . . . . . 48 4.2.6A Model for Multiple-Choice Test . . . . . . . . . . . . . 50 4.2.7A Model for Retrieval . . . . . . . . . . . . . . . . . . . . 51 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.1The LSMDC Dataset and Tasks . . . . . . . . . . . . . . 52 4.3.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 54 4.3.3Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 56 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Chapter 5 Joint Sequnece Fusion Attention for Multimodal Sequence Data 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3.1Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3.2The Joint Semantic Tensor . . . . . . . . . . . . . . . . . 65 5.3.3The Convolutional Hierarchical Decoder . . . . . . . . . . 66 5.3.4An Illustrative Example of How the JSFusion Model Works 68 5.3.5Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3.6Implementation of Video-Language Models . . . . . . . . 69 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4.1LSMDC Dataset and Tasks . . . . . . . . . . . . . . . . . 71 5.4.2MSR-VTT-(RET/MC) Dataset and Tasks . . . . . . . . . 73 5.4.3Quantitative Results . . . . . . . . . . . . . . . . . . . . . 74 5.4.4Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 76 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Chapter 6 Character Re-Identification and Character Ground-ing for Movie Understanding 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.3.1Video Preprocessing . . . . . . . . . . . . . . . . . . . . . 84 6.3.2Visual Track Embedding . . . . . . . . . . . . . . . . . . . 85 6.3.3Textual Character Embedding . . . . . . . . . . . . . . . 86 6.3.4Character Grounding . . . . . . . . . . . . . . . . . . . . 87 6.3.5Re-Identification . . . . . . . . . . . . . . . . . . . . . . . 88 6.3.6Joint Training . . . . . . . . . . . . . . . . . . . . . . . . 90 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.4.1Experimental Setup . . . . . . . . . . . . . . . . . . . . . 92 6.4.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 93 6.4.3Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 95 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Chapter 7 Transitional Adaptation of Pretrained Models forVisual Storytelling 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.3.1The Visual Encoder . . . . . . . . . . . . . . . . . . . . . 104 7.3.2The Language Generator . . . . . . . . . . . . . . . . . . 104 7.3.3Adaptation training . . . . . . . . . . . . . . . . . . . . . 105 7.3.4The Sequential Coherence Loss . . . . . . . . . . . . . . . 105 7.3.5Training with the adaptation Loss . . . . . . . . . . . . . 107 7.3.6Fine-tuning and Inference . . . . . . . . . . . . . . . . . . 107 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.4.1Experimental Setup . . . . . . . . . . . . . . . . . . . . . 109 7.4.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 112 7.4.3Further Analyses . . . . . . . . . . . . . . . . . . . . . . . 112 7.4.4Human Evaluation Results . . . . . . . . . . . . . . . . . 115 7.4.5Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 116 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Chapter 8 Conclusion 8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 8.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Bibliography ... 123 ์š”์•ฝ ... 148 Acknowledgements ... 150Docto

    Open Platforms for Connected Vehicles

    Get PDF
    L'abstract รจ presente nell'allegato / the abstract is in the attachmen
    corecore