Search CORE

1,203 research outputs found

Video Question Answering on Screencast Tutorials

Author: Jin Hailin
Kim Seokhwan
Xu Ning
Zhao Wentian
Publication venue: 'International Joint Conferences on Artificial Intelligence'
Publication date: 02/08/2020
Field of study

This paper presents a new video question answering task on screencast tutorials. We introduce a dataset including question, answer and context triples from the tutorial videos for a software. Unlike other video question answering works, all the answers in our dataset are grounded to the domain knowledge base. An one-shot recognition algorithm is designed to extract the visual cues, which helps enhance the performance of video question answering. We also propose several baseline neural network architectures based on various aspects of video contexts from the dataset. The experimental results demonstrate that our proposed models significantly improve the question answering performances by incorporating multi-modal contexts and domain knowledge

arXiv.org e-Print Archive

Crossref

MSG-BART: Multi-granularity Scene Graph-Enhanced Encoder-Decoder Language Model for Video-grounded Dialogue Generation

Author: Chen Zhe
Li Hui
Liu Hongcheng
Wang Pingjie
Wang Yanfeng
Wang Yu
Publication venue
Publication date: 26/09/2023
Field of study

Generating dialogue grounded in videos requires a high level of understanding and reasoning about the visual scenes in the videos. However, existing large visual-language models are not effective due to their latent features and decoder-only structure, especially with respect to spatio-temporal relationship reasoning. In this paper, we propose a novel approach named MSG-BART, which enhances the integration of video information by incorporating a multi-granularity spatio-temporal scene graph into an encoder-decoder pre-trained language model. Specifically, we integrate the global and local scene graph into the encoder and decoder, respectively, to improve both overall perception and target reasoning capability. To further improve the information selection capability, we propose a multi-pointer network to facilitate selection between text and video. Extensive experiments are conducted on three video-grounded dialogue benchmarks, which show the significant superiority of the proposed MSG-BART compared to a range of state-of-the-art approaches.Comment: 5 pages,3 figure

arXiv.org e-Print Archive

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

Author: Clark Christopher
Hoiem Derek
Kembhavi Aniruddha
Khosla Savya
Lee Sangho
Lu Jiasen
Marten Ryan
Zhang Zichen
Publication venue
Publication date: 28/12/2023
Field of study

We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action. To unify different modalities, we tokenize inputs and outputs -- images, text, audio, action, bounding boxes, etc., into a shared semantic space and then process them with a single encoder-decoder transformer model. Since training with such diverse modalities is challenging, we propose various architectural improvements to stabilize model training. We train our model from scratch on a large multimodal pre-training corpus from diverse sources with a multimodal mixture of denoisers objective. To learn an expansive set of skills, such as following multimodal instructions, we construct and finetune on an ensemble of 120 datasets with prompts and augmentations. With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark and strong results in more than 35 benchmarks, including image generation and understanding, natural language understanding, video and audio understanding, and robotic manipulation. We release all our models to the research community.Comment: 38 pages, 20 figure

arXiv.org e-Print Archive

Perception Test:A Diagnostic Benchmark for Multimodal Video Models

Author: Carriera Joao
Damen Dima
Patraucean Viorica
Zisserman Andrew
Publication venue
Publication date: 16/12/2023
Field of study

We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g. Flamingo, BEiT-3, or GPT-4). Compared to existing benchmarks that focus on computational tasks (e.g. classification, detection or tracking), the Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities, to provide a comprehensive and efficient evaluation tool. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime. For these purposes, the Perception Test introduces 11.6k real-world videos, 23s average length, designed to show perceptually interesting situations, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels (multiple-choice and grounded video question-answers, object and point tracks, temporal action and sound segments), enabling both language and non-language evaluations. The fine-tuning and validation splits of the benchmark are publicly available (CC-BY license), in addition to a challenge server with a held-out test split. Human baseline results compared to state-of-the-art video QA models show a significant gap in performance (91.4% vs 45.8%), suggesting that there is significant room for improvement in multimodal video understanding. Dataset, baselines code, and challenge server are available at https://github.com/deepmind/perception_tes

Explore Bristol Research

이야기형 설명문을 활용한 대규모 비디오 학습 연구

Author: 유영재
Publication venue: 서울대학교 대학원
Publication date: 01/02/2021
Field of study

학위논문 (박사) -- 서울대학교 대학원 : 공과대학 컴퓨터공학부, 2021. 2. 김건희.Extensive contributions are being made to develop intelligent agents that can recognize and communicate with the world. In this sense, various video-language tasks have drawn a lot of interests in computer vision research, including image/video captioning, video retrieval and video question answering. It can be applied to high-level computer vision tasks and various future industries such as search engines, social marketing, automated driving, and robotics support through QA / dialog generation for the surrounding environment. However, despite these developments, video-language learning suffers from a higher degree of complexity. This thesis investigates methodologies for learning the relationship between videos and free-formed languages, including explanations, conversations, and question-and-answers, so that the machine can easily adapt to target downstream tasks. First, we introduce several methods to learn the relationship between long sentences and videos efficiently. We introduce the approaches for supervising human attention transfer for the video attention model, which shows the video attention mechanism can benefit from explicit human gaze labels. Next, we introduce the end-to-end semantic attention method, which further reduces the visual attention algorithm's complexity by using the representative visual concept word detected by the attention-based detector. As a follow-up study on previous methods, we introduce a JSFusion (Joint Sequence Fusion) method that enables efficient video search and QA by enabling many-to-many matching of attention model. Next, we introduce the CiSIN(Character in Story Identification Network), which uses Attention to increase the performance of character grounding and character re-identification in the movie. Finally, we introduce Transitional Adaptation, which promotes the caption generation models to generates coherent narratives for long videos. In summary, this thesis presents a novel approaches for automatic video description generation/retrieval and shows the benefits of extracting linguistic knowledge for object and motion in the video as well as the advantage of multimodal audio-visual learning for understanding videos. Since the proposed methods are easily adapted to any video-language tasks, it is expected to be applied to the latest models, bringing additional performance improvements. Moving forward, we plan to design an unsupervised video learning framework that can solve many challenges in the industry by integrating an unlimited amount of video, audio, and free-formed language data from the web.시각-언어 학습은 이미지/비디오 캡션(Image/Video captioning), 시각 질의응답(Visual Question and Answering), 비디오 검색(Video Retrieval), 장면 이해(scene understanding), 이벤트 인식(event detection) 등 고차원의 컴퓨터 비전 태스크(task)뿐만 아니라 주변 환경에 대한 질의 응답 및 대화 생성(Dialogue Generation)으로 인터넷 검색 뿐만 아니라 최근 활발한 소셜 마케팅(Social Marketing) 자율 주행(Automated Driving), 로보틱스(Robotics)을 보조하는 등 여러 미래 산업에 적용될 수 있어 활발히 연구되고 있는 중요한 분야이다. 컴퓨터 비젼과 자연어 처리는 이러한 중요성을 바탕으로 각자 고유한 영역에서 발전을 거듭해 왔으나, 최근 딥러닝의 등장과 함께 눈부시게 발전하면서 서로를 보완하며 학습 결과를 향상시키는 등 큰 시너지 효과를 발휘하게 되었다. 하지만 이런 발전에도 불구하고, 비디오-언어간 학습은 문제의 복잡도가 한층 높아 어려움을 겪게 되는 경우가 많다. 본 논문에서는 비디오와 이에 대응하는 설명, 대화, 질의 응답 등 더 나아가 자유 형태의 언어 (Free-formed language)간의 관계를 더욱 효율적으로 학습하고, 목표 임무에 잘 대응할 수 있도록 개선하는 것을 목표로 한다. 먼저, 시각적 복잡도가 이미지보다 높은 비디오와 긴 문장 사이의 관계를 효율적으로 학습하기 위한 여러 방법들을 소개한다. 인간의 주의 인식(Attention) 모델을 비디오-언어 모델에 지도 학습 하는 방법을 소개하고, 이어서 비디오에서 우선 검출된 대표 시각 단어를 매개로 하여 주의 인식(Attention) 알고리즘의 복잡도를 더욱 줄이는 의미 중심 주의 인식 (Semantic Attention) 방법, 어텐션 모델의 다대다 매칭을 기반으로 효율적인 비디오 검색 및 질의응답을 가능케 하는 비디오-언어간 융합 (Joint Sequence Fusion) 방법 등 비디오 주의 인식을 효율적으로 학습시킬 수 있는 방법들을 제시한다. 다음으로는, 주의 인식(Attention) 모델이 물체-단어 간 관계를 넘어 비디오 상에서 인물 검색 (Person Searching) 그리고 인물 재 식별 (Person Re-Identification)을 동시에 수행하며 상승작용을 일으키는 스토리 속 캐릭터 인식 신경망 (Character in Story Identification Network) 을 소개하며, 마지막으로 자기 지도 학습(Self-supervised Learning)을 통해 주의 인식(Attention) 기반 언어 모델이 긴 비디오에 대한 설명을 연관성 있게 잘 생성할 수 있도록 유도하는 방법을 소개한다. 요약하자면, 이 학위 논문에서 제안한 새로운 방법론들은 비디오-언어 학습에 해당하는 비디오 캡션(Video captioning), 비디오 검색(Video Retrieval), 시각 질의응답(Video Question and Answering)등을 해결할 수 있는 기술적 디딤돌이 되며, 비디오 캡션 학습을 통해 학습된 주의 인식 모듈은 검색 및 질의응답, 인물 검색 등 각 네트워크에 이식되면서 새로운 문제들에 대해 동시에 최고 수준(State-of-the-art)의 성능을 달성하였다. 이를 통해 비디오-언어 학습으로 얻은 언어 지식의 이전은 시각-청각을 아우르는 비디오 멀티모달 학습에 큰 도움이 되는 것을 실험적으로 보여준다. 향후 작업 방향 (Future Work)으로는 앞서 연구한 내용들을 기반으로 웹 속에 존재하는 대규모의 언어, 비디오, 오디오 데이터를 통합해 학습에 활용하여 산업계의 많은 난제를 해결할 수 있는 비지도 학습 모델을 만들고자 한다.Chapter 1 Introduction 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 1.2 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . .8 Chapter 2 Related Work 2.1 Video Captioning . . . . . . . . . . . . . . . . . . . . . . . . . . .9 2.2 Video Retrieval with Natural Language . . . . . . . . . . . . . . 12 2.3 Video Question and Answering . . . . . . . . . . . . . . . . . . . 13 2.4 Cross-modal Representation Learning for Vision and LanguageTasks . . . . 15 Chapter 3 Human Attention Transfer for Video Captioning18 3.1 Introduction 3.2 Video Datasets for Caption and Gaze . . . . . . . . . . . . . . . 21 3.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.1 Video Pre-processing and Description . . . . . . . . . . . 22 3.3.2The Recurrent Gaze Prediction (RGP) Model . . . . . . . 23 3.3.3Construction of Visual Feature Pools . . . . . . . . . . . . 24 3.3.4The Decoder for Caption Generation . . . . . . . . . . . . 26 3.3.5Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.1Evaluation of Gaze Prediction . . . . . . . . . . . . . . . . 29 3.4.2Evaluation of Video Captioning . . . . . . . . . . . . . . . 32 3.4.3Human Evaluation via AMT . . . . . . . . . . . . . . . . 35 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Chapter 4 Semantic Word Attention for Video QA and VideoCaptioning 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1.1Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.2Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.1Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.2An Attention Model for Concept Detection . . . . . . . . 42 4.2.3Video-to-Language Models . . . . . . . . . . . . . . . . . 45 4.2.4A Model for Description . . . . . . . . . . . . . . . . . . . 45 4.2.5A Model for Fill-in-the-Blank . . . . . . . . . . . . . . . . 48 4.2.6A Model for Multiple-Choice Test . . . . . . . . . . . . . 50 4.2.7A Model for Retrieval . . . . . . . . . . . . . . . . . . . . 51 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.1The LSMDC Dataset and Tasks . . . . . . . . . . . . . . 52 4.3.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 54 4.3.3Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 56 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Chapter 5 Joint Sequnece Fusion Attention for Multimodal Sequence Data 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3.1Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3.2The Joint Semantic Tensor . . . . . . . . . . . . . . . . . 65 5.3.3The Convolutional Hierarchical Decoder . . . . . . . . . . 66 5.3.4An Illustrative Example of How the JSFusion Model Works 68 5.3.5Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3.6Implementation of Video-Language Models . . . . . . . . 69 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4.1LSMDC Dataset and Tasks . . . . . . . . . . . . . . . . . 71 5.4.2MSR-VTT-(RET/MC) Dataset and Tasks . . . . . . . . . 73 5.4.3Quantitative Results . . . . . . . . . . . . . . . . . . . . . 74 5.4.4Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 76 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Chapter 6 Character Re-Identification and Character Ground-ing for Movie Understanding 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.3.1Video Preprocessing . . . . . . . . . . . . . . . . . . . . . 84 6.3.2Visual Track Embedding . . . . . . . . . . . . . . . . . . . 85 6.3.3Textual Character Embedding . . . . . . . . . . . . . . . 86 6.3.4Character Grounding . . . . . . . . . . . . . . . . . . . . 87 6.3.5Re-Identification . . . . . . . . . . . . . . . . . . . . . . . 88 6.3.6Joint Training . . . . . . . . . . . . . . . . . . . . . . . . 90 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.4.1Experimental Setup . . . . . . . . . . . . . . . . . . . . . 92 6.4.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 93 6.4.3Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 95 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Chapter 7 Transitional Adaptation of Pretrained Models forVisual Storytelling 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.3.1The Visual Encoder . . . . . . . . . . . . . . . . . . . . . 104 7.3.2The Language Generator . . . . . . . . . . . . . . . . . . 104 7.3.3Adaptation training . . . . . . . . . . . . . . . . . . . . . 105 7.3.4The Sequential Coherence Loss . . . . . . . . . . . . . . . 105 7.3.5Training with the adaptation Loss . . . . . . . . . . . . . 107 7.3.6Fine-tuning and Inference . . . . . . . . . . . . . . . . . . 107 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.4.1Experimental Setup . . . . . . . . . . . . . . . . . . . . . 109 7.4.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 112 7.4.3Further Analyses . . . . . . . . . . . . . . . . . . . . . . . 112 7.4.4Human Evaluation Results . . . . . . . . . . . . . . . . . 115 7.4.5Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 116 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Chapter 8 Conclusion 8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 8.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Bibliography ... 123 요약 ... 148 Acknowledgements ... 150Docto

SNU Open Repository and Archive

A Survey of Natural Language Generation

Author: Chen Miaoxin
Dong Chenhe
Gong Haifan
Li Junxin
Li Yinghui
Shen Ying
Yang Min
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 02/08/2022
Field of study

This paper offers a comprehensive review of the research on Natural Language Generation (NLG) over the past two decades, especially in relation to data-to-text generation and text-to-text generation deep learning methods, as well as new applications of NLG technology. This survey aims to (a) give the latest synthesis of deep learning research on the NLG core tasks, as well as the architectures adopted in the field; (b) detail meticulously and comprehensively various NLG tasks and datasets, and draw attention to the challenges in NLG evaluation, focusing on different evaluation methods and their relationships; (c) highlight some future emphasis and relatively recent research issues that arise due to the increasing synergy between NLG and other artificial intelligence areas, such as computer vision, text and computational creativity.Comment: Accepted by ACM Computing Survey (CSUR) 202

arXiv.org e-Print Archive

Advancing Visual Grounding with Scene Knowledge: Benchmark and Method

Author: Chen Zhihong
Li Guanbin
Song Yibing
Wan Xiang
Zhang Ruifei
Publication venue
Publication date: 21/07/2023
Field of study

Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Ideally, it can be a testbed for vision-and-language models to evaluate their understanding of the images and texts and their reasoning abilities over their joint space. However, most existing VG datasets are constructed using simple description texts, which do not require sufficient reasoning over the images and texts. This has been demonstrated in a recent study~\cite{luo2022goes}, where a simple LSTM-based text encoder without pretraining can achieve state-of-the-art performance on mainstream VG datasets. Therefore, in this paper, we propose a novel benchmark of \underline{S}cene \underline{K}nowledge-guided \underline{V}isual \underline{G}rounding (SK-VG), where the image content and referring expressions are not sufficient to ground the target objects, forcing the models to have a reasoning ability on the long-form scene knowledge. To perform this task, we propose two approaches to accept the triple-type input, where the former embeds knowledge into the image features before the image-query interaction; the latter leverages linguistic structure to assist in computing the image-text matching. We conduct extensive experiments to analyze the above methods and show that the proposed approaches achieve promising results but still leave room for improvement, including performance and interpretability. The dataset and code are available at \url{https://github.com/zhjohnchan/SK-VG}.Comment: Computer Vision and Natural Language Processing. 21 pages, 14 figures. CVPR-202

arXiv.org e-Print Archive