461,004 research outputs found

    Learning Feature Selection and Combination Strategies for Generic Salient Object Detection

    No full text
    For a diverse range of applications in machine vision from social media searches to robotic home care providers, it is important to replicate the mechanism by which the human brain selects the most important visual information, while suppressing the remaining non-usable information. Many computational methods attempt to model this process by following the traditional model of visual attention. The traditional model of attention involves feature extraction, conditioning and combination to capture this behaviour of human visual attention. Consequently, the model has inherent design choices at its various stages. These choices include selection of parameters related to the feature computation process, setting a conditioning approach, feature importance and setting a combination approach. Despite rapid research and substantial improvements in benchmark performance, the performance of many models depends upon tuning these design choices in an ad hoc fashion. Additionally, these design choices are heuristic in nature, thus resulting in good performance only in certain settings. Consequentially, many such models exhibit low robustness to difficult stimuli and the complexities of real-world imagery. Machine learning and optimisation technique have long been used to increase the generalisability of a system to unseen data. Surprisingly, artificial learning techniques have not been investigated to their full potential to improve generalisation of visual attention methods. The proposed thesis is that artificial learning can increase the generalisability of the traditional model of visual attention by effective selection and optimal combination of features. The following new techniques have been introduced at various stages of the traditional model of visual attention to improve its generalisation performance, specifically on challenging cases of saliency detection: 1. Joint optimisation of feature related parameters and feature importance weights is introduced for the first time to improve the generalisation of the traditional model of visual attention. To evaluate the joint learning hypothesis, a new method namely GAOVSM is introduced for the tasks of eye fixation prediction. By finding the relationships between feature related parameters and feature importance, the developed method improves the generalisation performance of baseline method (that employ human encoded parameters). 2. Spectral matting based figure-ground segregation is introduced to overcome the artifacts encountered by region-based salient object detection approaches. By suppressing the unwanted background information and assigning saliency to object parts in a uniform manner, the developed FGS approach overcomes the limitations of region based approaches. 3. Joint optimisation of feature computation parameters and feature importance weights is introduced for optimal combination of FGS with complementary features for the first time for salient object detection. By learning feature related parameters and their respective importance at multiple segmentation thresholds and by considering the performance gaps amongst features, the developed FGSopt method improves the object detection performance of the FGS technique also improving upon several state-of-the-art salient object detection models. 4. The introduction of multiple combination schemes/rules further extends the generalisability of the traditional attention model beyond that of joint optimisation based single rules. The introduction of feature composition based grouping of images, enables the developed IGA method to autonomously identify an appropriate combination strategy for an unseen image. The results of a pair-wise ranksum test confirm that the IGA method is significantly better than the deterministic and classification based benchmark methods on the 99% confidence interval level. Extending this line of research, a novel relative encoding approach enables the adapted XCSCA method to group images having similar saliency prediction ability. By keeping track of previous inputs, the introduced action part of the XCSCA approach enables learning of generalised feature importance rules. By more accurate grouping of images as compared with IGA, generalised learnt rules and appropriate application of feature importance rules, the XCSCA approach improves upon the generalisation performance of the IGA method. 5. The introduced uniform saliency assignment and segmentation quality cues enable label free evaluation of a feature/saliency map. By accurate ranking and effective clustering, the developed DFS method successfully solves the complex problem of finding appropriate features for combination (on an-image-by-image basis) for the first time in saliency detection. The DFS method enables ground truth free evaluation of saliency methods and advances the state-of-the-art in data driven saliency aggregation by detection and deselection of redundant information. The final contribution is that the developed methods are formed into a complete system where analysis shows the effects of their interactions on the system. Based on the saliency prediction accuracy versus computational time trade-off, specialised variants of the proposed methods are presented along with the recommendations for further use by other saliency detection systems. This research work has shown that artificial learning can increase the generalisation of the traditional model of attention by effective selection and optimal combination of features. Overall, this thesis has shown that it is the ability to autonomously segregate images based on their types and subsequent learning of appropriate combinations that aid generalisation on difficult unseen stimuli

    Human behavior understanding for worker-centered intelligent manufacturing

    Get PDF
    β€œIn a worker-centered intelligent manufacturing system, sensing and understanding of the worker’s behavior are the primary tasks, which are essential for automatic performance evaluation & optimization, intelligent training & assistance, and human-robot collaboration. In this study, a worker-centered training & assistant system is proposed for intelligent manufacturing, which is featured with self-awareness and active-guidance. To understand the hand behavior, a method is proposed for complex hand gesture recognition using Convolutional Neural Networks (CNN) with multiview augmentation and inference fusion, from depth images captured by Microsoft Kinect. To sense and understand the worker in a more comprehensive way, a multi-modal approach is proposed for worker activity recognition using Inertial Measurement Unit (IMU) signals obtained from a Myo armband and videos from a visual camera. To automatically learn the importance of different sensors, a novel attention-based approach is proposed to human activity recognition using multiple IMU sensors worn at different body locations. To deploy the developed algorithms to the factory floor, a real-time assembly operation recognition system is proposed with fog computing and transfer learning. The proposed worker-centered training & assistant system has been validated and demonstrated the feasibility and great potential for applying to the manufacturing industry for frontline workers. Our developed approaches have been evaluated: 1) the multi-view approach outperforms the state-of-the-arts on two public benchmark datasets, 2) the multi-modal approach achieves an accuracy of 97% on a worker activity dataset including 6 activities and achieves the best performance on a public dataset, 3) the attention-based method outperforms the state-of-the-art methods on five publicly available datasets, and 4) the developed transfer learning model achieves a real-time recognition accuracy of 95% on a dataset including 10 worker operations”--Abstract, page iv

    Multiscale Discriminant Saliency for Visual Attention

    Full text link
    The bottom-up saliency, an early stage of humans' visual attention, can be considered as a binary classification problem between center and surround classes. Discriminant power of features for the classification is measured as mutual information between features and two classes distribution. The estimated discrepancy of two feature classes very much depends on considered scale levels; then, multi-scale structure and discriminant power are integrated by employing discrete wavelet features and Hidden markov tree (HMT). With wavelet coefficients and Hidden Markov Tree parameters, quad-tree like label structures are constructed and utilized in maximum a posterior probability (MAP) of hidden class variables at corresponding dyadic sub-squares. Then, saliency value for each dyadic square at each scale level is computed with discriminant power principle and the MAP. Finally, across multiple scales is integrated the final saliency map by an information maximization rule. Both standard quantitative tools such as NSS, LCC, AUC and qualitative assessments are used for evaluating the proposed multiscale discriminant saliency method (MDIS) against the well-know information-based saliency method AIM on its Bruce Database wity eye-tracking data. Simulation results are presented and analyzed to verify the validity of MDIS as well as point out its disadvantages for further research direction.Comment: 16 pages, ICCSA 2013 - BIOCA sessio

    Attention and Anticipation in Fast Visual-Inertial Navigation

    Get PDF
    We study a Visual-Inertial Navigation (VIN) problem in which a robot needs to estimate its state using an on-board camera and an inertial sensor, without any prior knowledge of the external environment. We consider the case in which the robot can allocate limited resources to VIN, due to tight computational constraints. Therefore, we answer the following question: under limited resources, what are the most relevant visual cues to maximize the performance of visual-inertial navigation? Our approach has four key ingredients. First, it is task-driven, in that the selection of the visual cues is guided by a metric quantifying the VIN performance. Second, it exploits the notion of anticipation, since it uses a simplified model for forward-simulation of robot dynamics, predicting the utility of a set of visual cues over a future time horizon. Third, it is efficient and easy to implement, since it leads to a greedy algorithm for the selection of the most relevant visual cues. Fourth, it provides formal performance guarantees: we leverage submodularity to prove that the greedy selection cannot be far from the optimal (combinatorial) selection. Simulations and real experiments on agile drones show that our approach ensures state-of-the-art VIN performance while maintaining a lean processing time. In the easy scenarios, our approach outperforms appearance-based feature selection in terms of localization errors. In the most challenging scenarios, it enables accurate visual-inertial navigation while appearance-based feature selection fails to track robot's motion during aggressive maneuvers.Comment: 20 pages, 7 figures, 2 table

    Movie Description

    Get PDF
    Audio Description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed ADs, which are temporally aligned to full length movies. In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions. In total the Large Scale Movie Description Challenge (LSMDC) contains a parallel corpus of 118,114 sentences and video clips from 202 movies. First we characterize the dataset by benchmarking different approaches for generating video descriptions. Comparing ADs to scripts, we find that ADs are indeed more visual and describe precisely what is shown rather than what should happen according to the scripts created prior to movie production. Furthermore, we present and compare the results of several teams who participated in a challenge organized in the context of the workshop "Describing and Understanding Video & The Large Scale Movie Description Challenge (LSMDC)", at ICCV 2015

    μ΄μ•ΌκΈ°ν˜• μ„€λͺ…문을 ν™œμš©ν•œ λŒ€κ·œλͺ¨ λΉ„λ””μ˜€ ν•™μŠ΅ 연ꡬ

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사) -- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 컴퓨터곡학뢀, 2021. 2. 김건희.Extensive contributions are being made to develop intelligent agents that can recognize and communicate with the world. In this sense, various video-language tasks have drawn a lot of interests in computer vision research, including image/video captioning, video retrieval and video question answering. It can be applied to high-level computer vision tasks and various future industries such as search engines, social marketing, automated driving, and robotics support through QA / dialog generation for the surrounding environment. However, despite these developments, video-language learning suffers from a higher degree of complexity. This thesis investigates methodologies for learning the relationship between videos and free-formed languages, including explanations, conversations, and question-and-answers, so that the machine can easily adapt to target downstream tasks. First, we introduce several methods to learn the relationship between long sentences and videos efficiently. We introduce the approaches for supervising human attention transfer for the video attention model, which shows the video attention mechanism can benefit from explicit human gaze labels. Next, we introduce the end-to-end semantic attention method, which further reduces the visual attention algorithm's complexity by using the representative visual concept word detected by the attention-based detector. As a follow-up study on previous methods, we introduce a JSFusion (Joint Sequence Fusion) method that enables efficient video search and QA by enabling many-to-many matching of attention model. Next, we introduce the CiSIN(Character in Story Identification Network), which uses Attention to increase the performance of character grounding and character re-identification in the movie. Finally, we introduce Transitional Adaptation, which promotes the caption generation models to generates coherent narratives for long videos. In summary, this thesis presents a novel approaches for automatic video description generation/retrieval and shows the benefits of extracting linguistic knowledge for object and motion in the video as well as the advantage of multimodal audio-visual learning for understanding videos. Since the proposed methods are easily adapted to any video-language tasks, it is expected to be applied to the latest models, bringing additional performance improvements. Moving forward, we plan to design an unsupervised video learning framework that can solve many challenges in the industry by integrating an unlimited amount of video, audio, and free-formed language data from the web.μ‹œκ°-μ–Έμ–΄ ν•™μŠ΅μ€ 이미지/λΉ„λ””μ˜€ μΊ‘μ…˜(Image/Video captioning), μ‹œκ° μ§ˆμ˜μ‘λ‹΅(Visual Question and Answering), λΉ„λ””μ˜€ 검색(Video Retrieval), μž₯λ©΄ 이해(scene understanding), 이벀트 인식(event detection) λ“± κ³ μ°¨μ›μ˜ 컴퓨터 λΉ„μ „ νƒœμŠ€ν¬(task)뿐만 μ•„λ‹ˆλΌ μ£Όλ³€ ν™˜κ²½μ— λŒ€ν•œ 질의 응닡 및 λŒ€ν™” 생성(Dialogue Generation)으둜 인터넷 검색 뿐만 μ•„λ‹ˆλΌ 졜근 ν™œλ°œν•œ μ†Œμ…œ λ§ˆμΌ€νŒ…(Social Marketing) 자율 μ£Όν–‰(Automated Driving), λ‘œλ³΄ν‹±μŠ€(Robotics)을 λ³΄μ‘°ν•˜λŠ” λ“± μ—¬λŸ¬ 미래 산업에 적용될 수 μžˆμ–΄ ν™œλ°œνžˆ μ—°κ΅¬λ˜κ³  μžˆλŠ” μ€‘μš”ν•œ 뢄야이닀. 컴퓨터 λΉ„μ Όκ³Ό μžμ—°μ–΄ μ²˜λ¦¬λŠ” μ΄λŸ¬ν•œ μ€‘μš”μ„±μ„ λ°”νƒ•μœΌλ‘œ 각자 κ³ μœ ν•œ μ˜μ—­μ—μ„œ λ°œμ „μ„ κ±°λ“­ν•΄ μ™”μœΌλ‚˜, 졜근 λ”₯λŸ¬λ‹μ˜ λ“±μž₯κ³Ό ν•¨κ»˜ λˆˆλΆ€μ‹œκ²Œ λ°œμ „ν•˜λ©΄μ„œ μ„œλ‘œλ₯Ό λ³΄μ™„ν•˜λ©° ν•™μŠ΅ κ²°κ³Όλ₯Ό ν–₯μƒμ‹œν‚€λŠ” λ“± 큰 μ‹œλ„ˆμ§€ 효과λ₯Ό λ°œνœ˜ν•˜κ²Œ λ˜μ—ˆλ‹€. ν•˜μ§€λ§Œ 이런 λ°œμ „μ—λ„ λΆˆκ΅¬ν•˜κ³ , λΉ„λ””μ˜€-μ–Έμ–΄κ°„ ν•™μŠ΅μ€ 문제의 λ³΅μž‘λ„κ°€ ν•œμΈ΅ λ†’μ•„ 어렀움을 κ²ͺ게 λ˜λŠ” κ²½μš°κ°€ λ§Žλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” λΉ„λ””μ˜€μ™€ 이에 λŒ€μ‘ν•˜λŠ” μ„€λͺ…, λŒ€ν™”, 질의 응닡 λ“± 더 λ‚˜μ•„κ°€ 자유 ν˜•νƒœμ˜ μ–Έμ–΄ (Free-formed language)κ°„μ˜ 관계λ₯Ό λ”μš± 효율적으둜 ν•™μŠ΅ν•˜κ³ , λͺ©ν‘œ μž„λ¬΄μ— 잘 λŒ€μ‘ν•  수 μžˆλ„λ‘ κ°œμ„ ν•˜λŠ” 것을 λͺ©ν‘œλ‘œ ν•œλ‹€. λ¨Όμ €, μ‹œκ°μ  λ³΅μž‘λ„κ°€ 이미지보닀 높은 λΉ„λ””μ˜€μ™€ κΈ΄ λ¬Έμž₯ μ‚¬μ΄μ˜ 관계λ₯Ό 효율적으둜 ν•™μŠ΅ν•˜κΈ° μœ„ν•œ μ—¬λŸ¬ 방법듀을 μ†Œκ°œν•œλ‹€. μΈκ°„μ˜ 주의 인식(Attention) λͺ¨λΈμ„ λΉ„λ””μ˜€-μ–Έμ–΄ λͺ¨λΈμ— 지도 ν•™μŠ΅ ν•˜λŠ” 방법을 μ†Œκ°œν•˜κ³ , μ΄μ–΄μ„œ λΉ„λ””μ˜€μ—μ„œ μš°μ„  κ²€μΆœλœ λŒ€ν‘œ μ‹œκ° 단어λ₯Ό 맀개둜 ν•˜μ—¬ 주의 인식(Attention) μ•Œκ³ λ¦¬μ¦˜μ˜ λ³΅μž‘λ„λ₯Ό λ”μš± μ€„μ΄λŠ” 의미 쀑심 주의 인식 (Semantic Attention) 방법, μ–΄ν…μ…˜ λͺ¨λΈμ˜ λ‹€λŒ€λ‹€ 맀칭을 기반으둜 효율적인 λΉ„λ””μ˜€ 검색 및 μ§ˆμ˜μ‘λ‹΅μ„ κ°€λŠ₯μΌ€ ν•˜λŠ” λΉ„λ””μ˜€-μ–Έμ–΄κ°„ μœ΅ν•© (Joint Sequence Fusion) 방법 λ“± λΉ„λ””μ˜€ 주의 인식을 효율적으둜 ν•™μŠ΅μ‹œν‚¬ 수 μžˆλŠ” 방법듀을 μ œμ‹œν•œλ‹€. λ‹€μŒμœΌλ‘œλŠ”, 주의 인식(Attention) λͺ¨λΈμ΄ 물체-단어 κ°„ 관계λ₯Ό λ„˜μ–΄ λΉ„λ””μ˜€ μƒμ—μ„œ 인물 검색 (Person Searching) 그리고 인물 재 식별 (Person Re-Identification)을 λ™μ‹œμ— μˆ˜ν–‰ν•˜λ©° μƒμŠΉμž‘μš©μ„ μΌμœΌν‚€λŠ” μŠ€ν† λ¦¬ 속 캐릭터 인식 신경망 (Character in Story Identification Network) 을 μ†Œκ°œν•˜λ©°, λ§ˆμ§€λ§‰μœΌλ‘œ 자기 지도 ν•™μŠ΅(Self-supervised Learning)을 톡해 주의 인식(Attention) 기반 μ–Έμ–΄ λͺ¨λΈμ΄ κΈ΄ λΉ„λ””μ˜€μ— λŒ€ν•œ μ„€λͺ…을 μ—°κ΄€μ„± 있게 잘 생성할 수 μžˆλ„λ‘ μœ λ„ν•˜λŠ” 방법을 μ†Œκ°œν•œλ‹€. μš”μ•½ν•˜μžλ©΄, 이 ν•™μœ„ λ…Όλ¬Έμ—μ„œ μ œμ•ˆν•œ μƒˆλ‘œμš΄ 방법둠듀은 λΉ„λ””μ˜€-μ–Έμ–΄ ν•™μŠ΅μ— ν•΄λ‹Ήν•˜λŠ” λΉ„λ””μ˜€ μΊ‘μ…˜(Video captioning), λΉ„λ””μ˜€ 검색(Video Retrieval), μ‹œκ° μ§ˆμ˜μ‘λ‹΅(Video Question and Answering)등을 ν•΄κ²°ν•  수 μžˆλŠ” 기술적 λ””λ”€λŒμ΄ 되며, λΉ„λ””μ˜€ μΊ‘μ…˜ ν•™μŠ΅μ„ 톡해 ν•™μŠ΅λœ 주의 인식 λͺ¨λ“ˆμ€ 검색 및 μ§ˆμ˜μ‘λ‹΅, 인물 검색 λ“± 각 λ„€νŠΈμ›Œν¬μ— μ΄μ‹λ˜λ©΄μ„œ μƒˆλ‘œμš΄ λ¬Έμ œλ“€μ— λŒ€ν•΄ λ™μ‹œμ— 졜고 μˆ˜μ€€(State-of-the-art)의 μ„±λŠ₯을 λ‹¬μ„±ν•˜μ˜€λ‹€. 이λ₯Ό 톡해 λΉ„λ””μ˜€-μ–Έμ–΄ ν•™μŠ΅μœΌλ‘œ 얻은 μ–Έμ–΄ μ§€μ‹μ˜ 이전은 μ‹œκ°-청각을 μ•„μš°λ₯΄λŠ” λΉ„λ””μ˜€ λ©€ν‹°λͺ¨λ‹¬ ν•™μŠ΅μ— 큰 도움이 λ˜λŠ” 것을 μ‹€ν—˜μ μœΌλ‘œ 보여쀀닀. ν–₯ν›„ μž‘μ—… λ°©ν–₯ (Future Work)μœΌλ‘œλŠ” μ•žμ„œ μ—°κ΅¬ν•œ λ‚΄μš©λ“€μ„ 기반으둜 μ›Ή 속에 μ‘΄μž¬ν•˜λŠ” λŒ€κ·œλͺ¨μ˜ μ–Έμ–΄, λΉ„λ””μ˜€, μ˜€λ””μ˜€ 데이터λ₯Ό 톡합해 ν•™μŠ΅μ— ν™œμš©ν•˜μ—¬ μ‚°μ—…κ³„μ˜ λ§Žμ€ λ‚œμ œλ₯Ό ν•΄κ²°ν•  수 μžˆλŠ” 비지도 ν•™μŠ΅ λͺ¨λΈμ„ λ§Œλ“€κ³ μž ν•œλ‹€.Chapter 1 Introduction 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 1.2 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . .8 Chapter 2 Related Work 2.1 Video Captioning . . . . . . . . . . . . . . . . . . . . . . . . . . .9 2.2 Video Retrieval with Natural Language . . . . . . . . . . . . . . 12 2.3 Video Question and Answering . . . . . . . . . . . . . . . . . . . 13 2.4 Cross-modal Representation Learning for Vision and LanguageTasks . . . . 15 Chapter 3 Human Attention Transfer for Video Captioning18 3.1 Introduction 3.2 Video Datasets for Caption and Gaze . . . . . . . . . . . . . . . 21 3.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.1 Video Pre-processing and Description . . . . . . . . . . . 22 3.3.2The Recurrent Gaze Prediction (RGP) Model . . . . . . . 23 3.3.3Construction of Visual Feature Pools . . . . . . . . . . . . 24 3.3.4The Decoder for Caption Generation . . . . . . . . . . . . 26 3.3.5Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.1Evaluation of Gaze Prediction . . . . . . . . . . . . . . . . 29 3.4.2Evaluation of Video Captioning . . . . . . . . . . . . . . . 32 3.4.3Human Evaluation via AMT . . . . . . . . . . . . . . . . 35 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Chapter 4 Semantic Word Attention for Video QA and VideoCaptioning 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1.1Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.2Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.1Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.2An Attention Model for Concept Detection . . . . . . . . 42 4.2.3Video-to-Language Models . . . . . . . . . . . . . . . . . 45 4.2.4A Model for Description . . . . . . . . . . . . . . . . . . . 45 4.2.5A Model for Fill-in-the-Blank . . . . . . . . . . . . . . . . 48 4.2.6A Model for Multiple-Choice Test . . . . . . . . . . . . . 50 4.2.7A Model for Retrieval . . . . . . . . . . . . . . . . . . . . 51 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.1The LSMDC Dataset and Tasks . . . . . . . . . . . . . . 52 4.3.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 54 4.3.3Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 56 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Chapter 5 Joint Sequnece Fusion Attention for Multimodal Sequence Data 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3.1Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3.2The Joint Semantic Tensor . . . . . . . . . . . . . . . . . 65 5.3.3The Convolutional Hierarchical Decoder . . . . . . . . . . 66 5.3.4An Illustrative Example of How the JSFusion Model Works 68 5.3.5Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3.6Implementation of Video-Language Models . . . . . . . . 69 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4.1LSMDC Dataset and Tasks . . . . . . . . . . . . . . . . . 71 5.4.2MSR-VTT-(RET/MC) Dataset and Tasks . . . . . . . . . 73 5.4.3Quantitative Results . . . . . . . . . . . . . . . . . . . . . 74 5.4.4Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 76 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Chapter 6 Character Re-Identification and Character Ground-ing for Movie Understanding 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.3.1Video Preprocessing . . . . . . . . . . . . . . . . . . . . . 84 6.3.2Visual Track Embedding . . . . . . . . . . . . . . . . . . . 85 6.3.3Textual Character Embedding . . . . . . . . . . . . . . . 86 6.3.4Character Grounding . . . . . . . . . . . . . . . . . . . . 87 6.3.5Re-Identification . . . . . . . . . . . . . . . . . . . . . . . 88 6.3.6Joint Training . . . . . . . . . . . . . . . . . . . . . . . . 90 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.4.1Experimental Setup . . . . . . . . . . . . . . . . . . . . . 92 6.4.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 93 6.4.3Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 95 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Chapter 7 Transitional Adaptation of Pretrained Models forVisual Storytelling 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.3.1The Visual Encoder . . . . . . . . . . . . . . . . . . . . . 104 7.3.2The Language Generator . . . . . . . . . . . . . . . . . . 104 7.3.3Adaptation training . . . . . . . . . . . . . . . . . . . . . 105 7.3.4The Sequential Coherence Loss . . . . . . . . . . . . . . . 105 7.3.5Training with the adaptation Loss . . . . . . . . . . . . . 107 7.3.6Fine-tuning and Inference . . . . . . . . . . . . . . . . . . 107 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.4.1Experimental Setup . . . . . . . . . . . . . . . . . . . . . 109 7.4.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 112 7.4.3Further Analyses . . . . . . . . . . . . . . . . . . . . . . . 112 7.4.4Human Evaluation Results . . . . . . . . . . . . . . . . . 115 7.4.5Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 116 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Chapter 8 Conclusion 8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 8.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Bibliography ... 123 μš”μ•½ ... 148 Acknowledgements ... 150Docto
    • …
    corecore