199,470 research outputs found

    Learning Visual Question Answering by Bootstrapping Hard Attention

    Full text link
    Attention mechanisms in biological perception are thought to select subsets of perceptual information for more sophisticated processing which would be prohibitive to perform on all sensory inputs. In computer vision, however, there has been relatively little exploration of hard attention, where some information is selectively ignored, in spite of the success of soft attention, where information is re-weighted and aggregated, but never filtered out. Here, we introduce a new approach for hard attention and find it achieves very competitive performance on a recently-released visual question answering datasets, equalling and in some cases surpassing similar soft attention architectures while entirely ignoring some features. Even though the hard attention mechanism is thought to be non-differentiable, we found that the feature magnitudes correlate with semantic relevance, and provide a useful signal for our mechanism's attentional selection criterion. Because hard attention selects important features of the input information, it can also be more efficient than analogous soft attention mechanisms. This is especially important for recent approaches that use non-local pairwise operations, whereby computational and memory costs are quadratic in the size of the set of features.Comment: ECCV 201

    Being Negative but Constructively: Lessons Learnt from Creating Better Visual Question Answering Datasets

    Full text link
    Visual question answering (Visual QA) has attracted a lot of attention lately, seen essentially as a form of (visual) Turing test that artificial intelligence should strive to achieve. In this paper, we study a crucial component of this task: how can we design good datasets for the task? We focus on the design of multiple-choice based datasets where the learner has to select the right answer from a set of candidate ones including the target (\ie the correct one) and the decoys (\ie the incorrect ones). Through careful analysis of the results attained by state-of-the-art learning models and human annotators on existing datasets, we show that the design of the decoy answers has a significant impact on how and what the learning models learn from the datasets. In particular, the resulting learner can ignore the visual information, the question, or both while still doing well on the task. Inspired by this, we propose automatic procedures to remedy such design deficiencies. We apply the procedures to re-construct decoy answers for two popular Visual QA datasets as well as to create a new Visual QA dataset from the Visual Genome project, resulting in the largest dataset for this task. Extensive empirical studies show that the design deficiencies have been alleviated in the remedied datasets and the performance on them is likely a more faithful indicator of the difference among learning models. The datasets are released and publicly available via http://www.teds.usc.edu/website_vqa/.Comment: Accepted for Oral Presentation at NAACL-HLT 201

    μ΄μ•ΌκΈ°ν˜• μ„€λͺ…문을 ν™œμš©ν•œ λŒ€κ·œλͺ¨ λΉ„λ””μ˜€ ν•™μŠ΅ 연ꡬ

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사) -- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 컴퓨터곡학뢀, 2021. 2. 김건희.Extensive contributions are being made to develop intelligent agents that can recognize and communicate with the world. In this sense, various video-language tasks have drawn a lot of interests in computer vision research, including image/video captioning, video retrieval and video question answering. It can be applied to high-level computer vision tasks and various future industries such as search engines, social marketing, automated driving, and robotics support through QA / dialog generation for the surrounding environment. However, despite these developments, video-language learning suffers from a higher degree of complexity. This thesis investigates methodologies for learning the relationship between videos and free-formed languages, including explanations, conversations, and question-and-answers, so that the machine can easily adapt to target downstream tasks. First, we introduce several methods to learn the relationship between long sentences and videos efficiently. We introduce the approaches for supervising human attention transfer for the video attention model, which shows the video attention mechanism can benefit from explicit human gaze labels. Next, we introduce the end-to-end semantic attention method, which further reduces the visual attention algorithm's complexity by using the representative visual concept word detected by the attention-based detector. As a follow-up study on previous methods, we introduce a JSFusion (Joint Sequence Fusion) method that enables efficient video search and QA by enabling many-to-many matching of attention model. Next, we introduce the CiSIN(Character in Story Identification Network), which uses Attention to increase the performance of character grounding and character re-identification in the movie. Finally, we introduce Transitional Adaptation, which promotes the caption generation models to generates coherent narratives for long videos. In summary, this thesis presents a novel approaches for automatic video description generation/retrieval and shows the benefits of extracting linguistic knowledge for object and motion in the video as well as the advantage of multimodal audio-visual learning for understanding videos. Since the proposed methods are easily adapted to any video-language tasks, it is expected to be applied to the latest models, bringing additional performance improvements. Moving forward, we plan to design an unsupervised video learning framework that can solve many challenges in the industry by integrating an unlimited amount of video, audio, and free-formed language data from the web.μ‹œκ°-μ–Έμ–΄ ν•™μŠ΅μ€ 이미지/λΉ„λ””μ˜€ μΊ‘μ…˜(Image/Video captioning), μ‹œκ° μ§ˆμ˜μ‘λ‹΅(Visual Question and Answering), λΉ„λ””μ˜€ 검색(Video Retrieval), μž₯λ©΄ 이해(scene understanding), 이벀트 인식(event detection) λ“± κ³ μ°¨μ›μ˜ 컴퓨터 λΉ„μ „ νƒœμŠ€ν¬(task)뿐만 μ•„λ‹ˆλΌ μ£Όλ³€ ν™˜κ²½μ— λŒ€ν•œ 질의 응닡 및 λŒ€ν™” 생성(Dialogue Generation)으둜 인터넷 검색 뿐만 μ•„λ‹ˆλΌ 졜근 ν™œλ°œν•œ μ†Œμ…œ λ§ˆμΌ€νŒ…(Social Marketing) 자율 μ£Όν–‰(Automated Driving), λ‘œλ³΄ν‹±μŠ€(Robotics)을 λ³΄μ‘°ν•˜λŠ” λ“± μ—¬λŸ¬ 미래 산업에 적용될 수 μžˆμ–΄ ν™œλ°œνžˆ μ—°κ΅¬λ˜κ³  μžˆλŠ” μ€‘μš”ν•œ 뢄야이닀. 컴퓨터 λΉ„μ Όκ³Ό μžμ—°μ–΄ μ²˜λ¦¬λŠ” μ΄λŸ¬ν•œ μ€‘μš”μ„±μ„ λ°”νƒ•μœΌλ‘œ 각자 κ³ μœ ν•œ μ˜μ—­μ—μ„œ λ°œμ „μ„ κ±°λ“­ν•΄ μ™”μœΌλ‚˜, 졜근 λ”₯λŸ¬λ‹μ˜ λ“±μž₯κ³Ό ν•¨κ»˜ λˆˆλΆ€μ‹œκ²Œ λ°œμ „ν•˜λ©΄μ„œ μ„œλ‘œλ₯Ό λ³΄μ™„ν•˜λ©° ν•™μŠ΅ κ²°κ³Όλ₯Ό ν–₯μƒμ‹œν‚€λŠ” λ“± 큰 μ‹œλ„ˆμ§€ 효과λ₯Ό λ°œνœ˜ν•˜κ²Œ λ˜μ—ˆλ‹€. ν•˜μ§€λ§Œ 이런 λ°œμ „μ—λ„ λΆˆκ΅¬ν•˜κ³ , λΉ„λ””μ˜€-μ–Έμ–΄κ°„ ν•™μŠ΅μ€ 문제의 λ³΅μž‘λ„κ°€ ν•œμΈ΅ λ†’μ•„ 어렀움을 κ²ͺ게 λ˜λŠ” κ²½μš°κ°€ λ§Žλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” λΉ„λ””μ˜€μ™€ 이에 λŒ€μ‘ν•˜λŠ” μ„€λͺ…, λŒ€ν™”, 질의 응닡 λ“± 더 λ‚˜μ•„κ°€ 자유 ν˜•νƒœμ˜ μ–Έμ–΄ (Free-formed language)κ°„μ˜ 관계λ₯Ό λ”μš± 효율적으둜 ν•™μŠ΅ν•˜κ³ , λͺ©ν‘œ μž„λ¬΄μ— 잘 λŒ€μ‘ν•  수 μžˆλ„λ‘ κ°œμ„ ν•˜λŠ” 것을 λͺ©ν‘œλ‘œ ν•œλ‹€. λ¨Όμ €, μ‹œκ°μ  λ³΅μž‘λ„κ°€ 이미지보닀 높은 λΉ„λ””μ˜€μ™€ κΈ΄ λ¬Έμž₯ μ‚¬μ΄μ˜ 관계λ₯Ό 효율적으둜 ν•™μŠ΅ν•˜κΈ° μœ„ν•œ μ—¬λŸ¬ 방법듀을 μ†Œκ°œν•œλ‹€. μΈκ°„μ˜ 주의 인식(Attention) λͺ¨λΈμ„ λΉ„λ””μ˜€-μ–Έμ–΄ λͺ¨λΈμ— 지도 ν•™μŠ΅ ν•˜λŠ” 방법을 μ†Œκ°œν•˜κ³ , μ΄μ–΄μ„œ λΉ„λ””μ˜€μ—μ„œ μš°μ„  κ²€μΆœλœ λŒ€ν‘œ μ‹œκ° 단어λ₯Ό 맀개둜 ν•˜μ—¬ 주의 인식(Attention) μ•Œκ³ λ¦¬μ¦˜μ˜ λ³΅μž‘λ„λ₯Ό λ”μš± μ€„μ΄λŠ” 의미 쀑심 주의 인식 (Semantic Attention) 방법, μ–΄ν…μ…˜ λͺ¨λΈμ˜ λ‹€λŒ€λ‹€ 맀칭을 기반으둜 효율적인 λΉ„λ””μ˜€ 검색 및 μ§ˆμ˜μ‘λ‹΅μ„ κ°€λŠ₯μΌ€ ν•˜λŠ” λΉ„λ””μ˜€-μ–Έμ–΄κ°„ μœ΅ν•© (Joint Sequence Fusion) 방법 λ“± λΉ„λ””μ˜€ 주의 인식을 효율적으둜 ν•™μŠ΅μ‹œν‚¬ 수 μžˆλŠ” 방법듀을 μ œμ‹œν•œλ‹€. λ‹€μŒμœΌλ‘œλŠ”, 주의 인식(Attention) λͺ¨λΈμ΄ 물체-단어 κ°„ 관계λ₯Ό λ„˜μ–΄ λΉ„λ””μ˜€ μƒμ—μ„œ 인물 검색 (Person Searching) 그리고 인물 재 식별 (Person Re-Identification)을 λ™μ‹œμ— μˆ˜ν–‰ν•˜λ©° μƒμŠΉμž‘μš©μ„ μΌμœΌν‚€λŠ” μŠ€ν† λ¦¬ 속 캐릭터 인식 신경망 (Character in Story Identification Network) 을 μ†Œκ°œν•˜λ©°, λ§ˆμ§€λ§‰μœΌλ‘œ 자기 지도 ν•™μŠ΅(Self-supervised Learning)을 톡해 주의 인식(Attention) 기반 μ–Έμ–΄ λͺ¨λΈμ΄ κΈ΄ λΉ„λ””μ˜€μ— λŒ€ν•œ μ„€λͺ…을 μ—°κ΄€μ„± 있게 잘 생성할 수 μžˆλ„λ‘ μœ λ„ν•˜λŠ” 방법을 μ†Œκ°œν•œλ‹€. μš”μ•½ν•˜μžλ©΄, 이 ν•™μœ„ λ…Όλ¬Έμ—μ„œ μ œμ•ˆν•œ μƒˆλ‘œμš΄ 방법둠듀은 λΉ„λ””μ˜€-μ–Έμ–΄ ν•™μŠ΅μ— ν•΄λ‹Ήν•˜λŠ” λΉ„λ””μ˜€ μΊ‘μ…˜(Video captioning), λΉ„λ””μ˜€ 검색(Video Retrieval), μ‹œκ° μ§ˆμ˜μ‘λ‹΅(Video Question and Answering)등을 ν•΄κ²°ν•  수 μžˆλŠ” 기술적 λ””λ”€λŒμ΄ 되며, λΉ„λ””μ˜€ μΊ‘μ…˜ ν•™μŠ΅μ„ 톡해 ν•™μŠ΅λœ 주의 인식 λͺ¨λ“ˆμ€ 검색 및 μ§ˆμ˜μ‘λ‹΅, 인물 검색 λ“± 각 λ„€νŠΈμ›Œν¬μ— μ΄μ‹λ˜λ©΄μ„œ μƒˆλ‘œμš΄ λ¬Έμ œλ“€μ— λŒ€ν•΄ λ™μ‹œμ— 졜고 μˆ˜μ€€(State-of-the-art)의 μ„±λŠ₯을 λ‹¬μ„±ν•˜μ˜€λ‹€. 이λ₯Ό 톡해 λΉ„λ””μ˜€-μ–Έμ–΄ ν•™μŠ΅μœΌλ‘œ 얻은 μ–Έμ–΄ μ§€μ‹μ˜ 이전은 μ‹œκ°-청각을 μ•„μš°λ₯΄λŠ” λΉ„λ””μ˜€ λ©€ν‹°λͺ¨λ‹¬ ν•™μŠ΅μ— 큰 도움이 λ˜λŠ” 것을 μ‹€ν—˜μ μœΌλ‘œ 보여쀀닀. ν–₯ν›„ μž‘μ—… λ°©ν–₯ (Future Work)μœΌλ‘œλŠ” μ•žμ„œ μ—°κ΅¬ν•œ λ‚΄μš©λ“€μ„ 기반으둜 μ›Ή 속에 μ‘΄μž¬ν•˜λŠ” λŒ€κ·œλͺ¨μ˜ μ–Έμ–΄, λΉ„λ””μ˜€, μ˜€λ””μ˜€ 데이터λ₯Ό 톡합해 ν•™μŠ΅μ— ν™œμš©ν•˜μ—¬ μ‚°μ—…κ³„μ˜ λ§Žμ€ λ‚œμ œλ₯Ό ν•΄κ²°ν•  수 μžˆλŠ” 비지도 ν•™μŠ΅ λͺ¨λΈμ„ λ§Œλ“€κ³ μž ν•œλ‹€.Chapter 1 Introduction 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 1.2 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . .8 Chapter 2 Related Work 2.1 Video Captioning . . . . . . . . . . . . . . . . . . . . . . . . . . .9 2.2 Video Retrieval with Natural Language . . . . . . . . . . . . . . 12 2.3 Video Question and Answering . . . . . . . . . . . . . . . . . . . 13 2.4 Cross-modal Representation Learning for Vision and LanguageTasks . . . . 15 Chapter 3 Human Attention Transfer for Video Captioning18 3.1 Introduction 3.2 Video Datasets for Caption and Gaze . . . . . . . . . . . . . . . 21 3.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.1 Video Pre-processing and Description . . . . . . . . . . . 22 3.3.2The Recurrent Gaze Prediction (RGP) Model . . . . . . . 23 3.3.3Construction of Visual Feature Pools . . . . . . . . . . . . 24 3.3.4The Decoder for Caption Generation . . . . . . . . . . . . 26 3.3.5Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.1Evaluation of Gaze Prediction . . . . . . . . . . . . . . . . 29 3.4.2Evaluation of Video Captioning . . . . . . . . . . . . . . . 32 3.4.3Human Evaluation via AMT . . . . . . . . . . . . . . . . 35 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Chapter 4 Semantic Word Attention for Video QA and VideoCaptioning 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1.1Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.2Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.1Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.2An Attention Model for Concept Detection . . . . . . . . 42 4.2.3Video-to-Language Models . . . . . . . . . . . . . . . . . 45 4.2.4A Model for Description . . . . . . . . . . . . . . . . . . . 45 4.2.5A Model for Fill-in-the-Blank . . . . . . . . . . . . . . . . 48 4.2.6A Model for Multiple-Choice Test . . . . . . . . . . . . . 50 4.2.7A Model for Retrieval . . . . . . . . . . . . . . . . . . . . 51 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.1The LSMDC Dataset and Tasks . . . . . . . . . . . . . . 52 4.3.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 54 4.3.3Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 56 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Chapter 5 Joint Sequnece Fusion Attention for Multimodal Sequence Data 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3.1Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3.2The Joint Semantic Tensor . . . . . . . . . . . . . . . . . 65 5.3.3The Convolutional Hierarchical Decoder . . . . . . . . . . 66 5.3.4An Illustrative Example of How the JSFusion Model Works 68 5.3.5Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3.6Implementation of Video-Language Models . . . . . . . . 69 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4.1LSMDC Dataset and Tasks . . . . . . . . . . . . . . . . . 71 5.4.2MSR-VTT-(RET/MC) Dataset and Tasks . . . . . . . . . 73 5.4.3Quantitative Results . . . . . . . . . . . . . . . . . . . . . 74 5.4.4Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 76 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Chapter 6 Character Re-Identification and Character Ground-ing for Movie Understanding 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.3.1Video Preprocessing . . . . . . . . . . . . . . . . . . . . . 84 6.3.2Visual Track Embedding . . . . . . . . . . . . . . . . . . . 85 6.3.3Textual Character Embedding . . . . . . . . . . . . . . . 86 6.3.4Character Grounding . . . . . . . . . . . . . . . . . . . . 87 6.3.5Re-Identification . . . . . . . . . . . . . . . . . . . . . . . 88 6.3.6Joint Training . . . . . . . . . . . . . . . . . . . . . . . . 90 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.4.1Experimental Setup . . . . . . . . . . . . . . . . . . . . . 92 6.4.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 93 6.4.3Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 95 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Chapter 7 Transitional Adaptation of Pretrained Models forVisual Storytelling 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.3.1The Visual Encoder . . . . . . . . . . . . . . . . . . . . . 104 7.3.2The Language Generator . . . . . . . . . . . . . . . . . . 104 7.3.3Adaptation training . . . . . . . . . . . . . . . . . . . . . 105 7.3.4The Sequential Coherence Loss . . . . . . . . . . . . . . . 105 7.3.5Training with the adaptation Loss . . . . . . . . . . . . . 107 7.3.6Fine-tuning and Inference . . . . . . . . . . . . . . . . . . 107 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.4.1Experimental Setup . . . . . . . . . . . . . . . . . . . . . 109 7.4.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 112 7.4.3Further Analyses . . . . . . . . . . . . . . . . . . . . . . . 112 7.4.4Human Evaluation Results . . . . . . . . . . . . . . . . . 115 7.4.5Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 116 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Chapter 8 Conclusion 8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 8.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Bibliography ... 123 μš”μ•½ ... 148 Acknowledgements ... 150Docto

    Visual Question Answering: A Survey of Methods and Datasets

    Full text link
    Visual Question Answering (VQA) is a challenging task that has received increasing attention from both the computer vision and the natural language processing communities. Given an image and a question in natural language, it requires reasoning over visual elements of the image and general knowledge to infer the correct answer. In the first part of this survey, we examine the state of the art by comparing modern approaches to the problem. We classify methods by their mechanism to connect the visual and textual modalities. In particular, we examine the common approach of combining convolutional and recurrent neural networks to map images and questions to a common feature space. We also discuss memory-augmented and modular architectures that interface with structured knowledge bases. In the second part of this survey, we review the datasets available for training and evaluating VQA systems. The various datatsets contain questions at different levels of complexity, which require different capabilities and types of reasoning. We examine in depth the question/answer pairs from the Visual Genome project, and evaluate the relevance of the structured annotations of images with scene graphs for VQA. Finally, we discuss promising future directions for the field, in particular the connection to structured knowledge bases and the use of natural language processing models.Comment: 25 page
    • …
    corecore