1,203 research outputs found
Video Question Answering on Screencast Tutorials
This paper presents a new video question answering task on screencast
tutorials. We introduce a dataset including question, answer and context
triples from the tutorial videos for a software. Unlike other video question
answering works, all the answers in our dataset are grounded to the domain
knowledge base. An one-shot recognition algorithm is designed to extract the
visual cues, which helps enhance the performance of video question answering.
We also propose several baseline neural network architectures based on various
aspects of video contexts from the dataset. The experimental results
demonstrate that our proposed models significantly improve the question
answering performances by incorporating multi-modal contexts and domain
knowledge
MSG-BART: Multi-granularity Scene Graph-Enhanced Encoder-Decoder Language Model for Video-grounded Dialogue Generation
Generating dialogue grounded in videos requires a high level of understanding
and reasoning about the visual scenes in the videos. However, existing large
visual-language models are not effective due to their latent features and
decoder-only structure, especially with respect to spatio-temporal relationship
reasoning. In this paper, we propose a novel approach named MSG-BART, which
enhances the integration of video information by incorporating a
multi-granularity spatio-temporal scene graph into an encoder-decoder
pre-trained language model. Specifically, we integrate the global and local
scene graph into the encoder and decoder, respectively, to improve both overall
perception and target reasoning capability. To further improve the information
selection capability, we propose a multi-pointer network to facilitate
selection between text and video. Extensive experiments are conducted on three
video-grounded dialogue benchmarks, which show the significant superiority of
the proposed MSG-BART compared to a range of state-of-the-art approaches.Comment: 5 pages,3 figure
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
We present Unified-IO 2, the first autoregressive multimodal model that is
capable of understanding and generating image, text, audio, and action. To
unify different modalities, we tokenize inputs and outputs -- images, text,
audio, action, bounding boxes, etc., into a shared semantic space and then
process them with a single encoder-decoder transformer model. Since training
with such diverse modalities is challenging, we propose various architectural
improvements to stabilize model training. We train our model from scratch on a
large multimodal pre-training corpus from diverse sources with a multimodal
mixture of denoisers objective. To learn an expansive set of skills, such as
following multimodal instructions, we construct and finetune on an ensemble of
120 datasets with prompts and augmentations. With a single unified model,
Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark and
strong results in more than 35 benchmarks, including image generation and
understanding, natural language understanding, video and audio understanding,
and robotic manipulation. We release all our models to the research community.Comment: 38 pages, 20 figure
Perception Test:A Diagnostic Benchmark for Multimodal Video Models
We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g. Flamingo, BEiT-3, or GPT-4). Compared to existing benchmarks that focus on computational tasks (e.g. classification, detection or tracking), the Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities, to provide a comprehensive and efficient evaluation tool. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime. For these purposes, the Perception Test introduces 11.6k real-world videos, 23s average length, designed to show perceptually interesting situations, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels (multiple-choice and grounded video question-answers, object and point tracks, temporal action and sound segments), enabling both language and non-language evaluations. The fine-tuning and validation splits of the benchmark are publicly available (CC-BY license), in addition to a challenge server with a held-out test split. Human baseline results compared to state-of-the-art video QA models show a significant gap in performance (91.4% vs 45.8%), suggesting that there is significant room for improvement in multimodal video understanding. Dataset, baselines code, and challenge server are available at https://github.com/deepmind/perception_tes
μ΄μΌκΈ°ν μ€λͺ λ¬Έμ νμ©ν λκ·λͺ¨ λΉλμ€ νμ΅ μ°κ΅¬
νμλ
Όλ¬Έ (λ°μ¬) -- μμΈλνκ΅ λνμ : 곡과λν μ»΄ν¨ν°κ³΅νλΆ, 2021. 2. κΉκ±΄ν¬.Extensive contributions are being made to develop intelligent agents that can recognize and communicate with the world. In this sense, various video-language tasks have drawn a lot of interests in computer vision research, including image/video captioning, video retrieval and video question answering.
It can be applied to high-level computer vision tasks and various future industries such as search engines, social marketing, automated driving, and robotics support through QA / dialog generation for the surrounding environment.
However, despite these developments, video-language learning suffers from a higher degree of complexity.
This thesis investigates methodologies for learning the relationship between videos and free-formed languages, including explanations, conversations, and question-and-answers, so that the machine can easily adapt to target downstream tasks.
First, we introduce several methods to learn the relationship between long sentences and videos efficiently. We introduce the approaches for supervising human attention transfer for the video attention model, which shows the video attention mechanism can benefit from explicit human gaze labels. Next, we introduce the end-to-end semantic attention method, which further reduces the visual attention algorithm's complexity by using the representative visual concept word detected by the attention-based detector. As a follow-up study on previous methods, we introduce a JSFusion (Joint Sequence Fusion) method that enables efficient video search and QA by enabling many-to-many matching of attention model.
Next, we introduce the CiSIN(Character in Story Identification Network), which uses Attention to increase the performance of character grounding and character re-identification in the movie. Finally, we introduce Transitional Adaptation, which promotes the caption generation models to generates coherent narratives for long videos.
In summary, this thesis presents a novel approaches for automatic video description generation/retrieval and shows the benefits of extracting linguistic knowledge for object and motion in the video as well as the advantage of multimodal audio-visual learning for understanding videos. Since the proposed methods are easily adapted to any video-language tasks, it is expected to be applied to the latest models, bringing additional performance improvements.
Moving forward, we plan to design an unsupervised video learning framework that can solve many challenges in the industry by integrating an unlimited amount of video, audio, and free-formed language data from the web.μκ°-μΈμ΄ νμ΅μ μ΄λ―Έμ§/λΉλμ€ μΊ‘μ
(Image/Video captioning), μκ° μ§μμλ΅(Visual Question and Answering), λΉλμ€ κ²μ(Video Retrieval), μ₯λ©΄ μ΄ν΄(scene understanding), μ΄λ²€νΈ μΈμ(event detection) λ± κ³ μ°¨μμ μ»΄ν¨ν° λΉμ νμ€ν¬(task)λΏλ§ μλλΌ μ£Όλ³ νκ²½μ λν μ§μ μλ΅ λ° λν μμ±(Dialogue Generation)μΌλ‘ μΈν°λ· κ²μ λΏλ§ μλλΌ μ΅κ·Ό νλ°ν μμ
λ§μΌν
(Social Marketing) μμ¨ μ£Όν(Automated Driving), λ‘보ν±μ€(Robotics)μ 보쑰νλ λ± μ¬λ¬ λ―Έλ μ°μ
μ μ μ©λ μ μμ΄ νλ°ν μ°κ΅¬λκ³ μλ μ€μν λΆμΌμ΄λ€.
μ»΄ν¨ν° λΉμ Όκ³Ό μμ°μ΄ μ²λ¦¬λ μ΄λ¬ν μ€μμ±μ λ°νμΌλ‘ κ°μ κ³ μ ν μμμμ λ°μ μ κ±°λν΄ μμΌλ, μ΅κ·Ό λ₯λ¬λμ λ±μ₯κ³Ό ν¨κ» λλΆμκ² λ°μ νλ©΄μ μλ‘λ₯Ό 보μνλ©° νμ΅ κ²°κ³Όλ₯Ό ν₯μμν€λ λ± ν° μλμ§ ν¨κ³Όλ₯Ό λ°ννκ² λμλ€.
νμ§λ§ μ΄λ° λ°μ μλ λΆκ΅¬νκ³ , λΉλμ€-μΈμ΄κ° νμ΅μ λ¬Έμ μ 볡μ‘λκ° νμΈ΅ λμ μ΄λ €μμ κ²ͺκ² λλ κ²½μ°κ° λ§λ€.
λ³Έ λ
Όλ¬Έμμλ λΉλμ€μ μ΄μ λμνλ μ€λͺ
, λν, μ§μ μλ΅ λ± λ λμκ° μμ ννμ μΈμ΄ (Free-formed language)κ°μ κ΄κ³λ₯Ό λμ± ν¨μ¨μ μΌλ‘ νμ΅νκ³ , λͺ©ν μ무μ μ λμν μ μλλ‘ κ°μ νλ κ²μ λͺ©νλ‘ νλ€.
λ¨Όμ , μκ°μ 볡μ‘λκ° μ΄λ―Έμ§λ³΄λ€ λμ λΉλμ€μ κΈ΄ λ¬Έμ₯ μ¬μ΄μ κ΄κ³λ₯Ό ν¨μ¨μ μΌλ‘ νμ΅νκΈ° μν μ¬λ¬ λ°©λ²λ€μ μκ°νλ€. μΈκ°μ μ£Όμ μΈμ(Attention) λͺ¨λΈμ λΉλμ€-μΈμ΄ λͺ¨λΈμ μ§λ νμ΅ νλ λ°©λ²μ μκ°νκ³ , μ΄μ΄μ λΉλμ€μμ μ°μ κ²μΆλ λν μκ° λ¨μ΄λ₯Ό 맀κ°λ‘ νμ¬ μ£Όμ μΈμ(Attention) μκ³ λ¦¬μ¦μ 볡μ‘λλ₯Ό λμ± μ€μ΄λ μλ―Έ μ€μ¬ μ£Όμ μΈμ (Semantic Attention) λ°©λ², μ΄ν
μ
λͺ¨λΈμ λ€λλ€ λ§€μΉμ κΈ°λ°μΌλ‘ ν¨μ¨μ μΈ λΉλμ€ κ²μ λ° μ§μμλ΅μ κ°λ₯μΌ νλ λΉλμ€-μΈμ΄κ° μ΅ν© (Joint Sequence Fusion) λ°©λ² λ± λΉλμ€ μ£Όμ μΈμμ ν¨μ¨μ μΌλ‘ νμ΅μν¬ μ μλ λ°©λ²λ€μ μ μνλ€.
λ€μμΌλ‘λ, μ£Όμ μΈμ(Attention) λͺ¨λΈμ΄ 물체-λ¨μ΄ κ° κ΄κ³λ₯Ό λμ΄ λΉλμ€ μμμ μΈλ¬Ό κ²μ (Person Searching) κ·Έλ¦¬κ³ μΈλ¬Ό μ¬ μλ³ (Person Re-Identification)μ λμμ μννλ©° μμΉμμ©μ μΌμΌν€λ μ€ν 리 μ μΊλ¦ν° μΈμ μ κ²½λ§ (Character in Story Identification Network) μ μκ°νλ©°, λ§μ§λ§μΌλ‘ μκΈ° μ§λ νμ΅(Self-supervised Learning)μ ν΅ν΄ μ£Όμ μΈμ(Attention) κΈ°λ° μΈμ΄ λͺ¨λΈμ΄ κΈ΄ λΉλμ€μ λν μ€λͺ
μ μ°κ΄μ± μκ² μ μμ±ν μ μλλ‘ μ λνλ λ°©λ²μ μκ°νλ€.
μμ½νμλ©΄, μ΄ νμ λ
Όλ¬Έμμ μ μν μλ‘μ΄ λ°©λ²λ‘ λ€μ λΉλμ€-μΈμ΄ νμ΅μ ν΄λΉνλ λΉλμ€ μΊ‘μ
(Video captioning), λΉλμ€ κ²μ(Video Retrieval), μκ° μ§μμλ΅(Video Question and Answering)λ±μ ν΄κ²°ν μ μλ κΈ°μ μ λλ€λμ΄ λλ©°, λΉλμ€ μΊ‘μ
νμ΅μ ν΅ν΄ νμ΅λ μ£Όμ μΈμ λͺ¨λμ κ²μ λ° μ§μμλ΅, μΈλ¬Ό κ²μ λ± κ° λ€νΈμν¬μ μ΄μλλ©΄μ μλ‘μ΄ λ¬Έμ λ€μ λν΄ λμμ μ΅κ³ μμ€(State-of-the-art)μ μ±λ₯μ λ¬μ±νμλ€. μ΄λ₯Ό ν΅ν΄ λΉλμ€-μΈμ΄ νμ΅μΌλ‘ μ»μ μΈμ΄ μ§μμ μ΄μ μ μκ°-μ²κ°μ μμ°λ₯΄λ λΉλμ€ λ©ν°λͺ¨λ¬ νμ΅μ ν° λμμ΄ λλ κ²μ μ€νμ μΌλ‘ 보μ¬μ€λ€. ν₯ν μμ
λ°©ν₯ (Future Work)μΌλ‘λ μμ μ°κ΅¬ν λ΄μ©λ€μ κΈ°λ°μΌλ‘ μΉ μμ μ‘΄μ¬νλ λκ·λͺ¨μ μΈμ΄, λΉλμ€, μ€λμ€ λ°μ΄ν°λ₯Ό ν΅ν©ν΄ νμ΅μ νμ©νμ¬ μ°μ
κ³μ λ§μ λμ λ₯Ό ν΄κ²°ν μ μλ λΉμ§λ νμ΅ λͺ¨λΈμ λ§λ€κ³ μ νλ€.Chapter 1
Introduction
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4
1.2 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . .8
Chapter 2
Related Work
2.1 Video Captioning . . . . . . . . . . . . . . . . . . . . . . . . . . .9
2.2 Video Retrieval with Natural Language . . . . . . . . . . . . . . 12
2.3 Video Question and Answering . . . . . . . . . . . . . . . . . . . 13
2.4 Cross-modal Representation Learning for Vision and LanguageTasks . . . . 15
Chapter 3 Human Attention Transfer for Video Captioning18
3.1 Introduction
3.2 Video Datasets for Caption and Gaze . . . . . . . . . . . . . . . 21
3.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Video Pre-processing and Description . . . . . . . . . . . 22
3.3.2The Recurrent Gaze Prediction (RGP) Model . . . . . . . 23
3.3.3Construction of Visual Feature Pools . . . . . . . . . . . . 24
3.3.4The Decoder for Caption Generation . . . . . . . . . . . . 26
3.3.5Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.1Evaluation of Gaze Prediction . . . . . . . . . . . . . . . . 29
3.4.2Evaluation of Video Captioning . . . . . . . . . . . . . . . 32
3.4.3Human Evaluation via AMT . . . . . . . . . . . . . . . . 35
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Chapter 4 Semantic Word Attention for Video QA and VideoCaptioning
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.1Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.2Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.1Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.2An Attention Model for Concept Detection . . . . . . . . 42
4.2.3Video-to-Language Models . . . . . . . . . . . . . . . . . 45
4.2.4A Model for Description . . . . . . . . . . . . . . . . . . . 45
4.2.5A Model for Fill-in-the-Blank . . . . . . . . . . . . . . . . 48
4.2.6A Model for Multiple-Choice Test . . . . . . . . . . . . . 50
4.2.7A Model for Retrieval . . . . . . . . . . . . . . . . . . . . 51
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.1The LSMDC Dataset and Tasks . . . . . . . . . . . . . . 52
4.3.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 54
4.3.3Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Chapter 5 Joint Sequnece Fusion Attention for Multimodal Sequence Data
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.1Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.2The Joint Semantic Tensor . . . . . . . . . . . . . . . . . 65
5.3.3The Convolutional Hierarchical Decoder . . . . . . . . . . 66
5.3.4An Illustrative Example of How the JSFusion Model Works 68
5.3.5Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3.6Implementation of Video-Language Models . . . . . . . . 69
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.1LSMDC Dataset and Tasks . . . . . . . . . . . . . . . . . 71
5.4.2MSR-VTT-(RET/MC) Dataset and Tasks . . . . . . . . . 73
5.4.3Quantitative Results . . . . . . . . . . . . . . . . . . . . . 74
5.4.4Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 76
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Chapter 6 Character Re-Identification and Character Ground-ing for Movie Understanding
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3.1Video Preprocessing . . . . . . . . . . . . . . . . . . . . . 84
6.3.2Visual Track Embedding . . . . . . . . . . . . . . . . . . . 85
6.3.3Textual Character Embedding . . . . . . . . . . . . . . . 86
6.3.4Character Grounding . . . . . . . . . . . . . . . . . . . . 87
6.3.5Re-Identification . . . . . . . . . . . . . . . . . . . . . . . 88
6.3.6Joint Training . . . . . . . . . . . . . . . . . . . . . . . . 90
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.4.1Experimental Setup . . . . . . . . . . . . . . . . . . . . . 92
6.4.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 93
6.4.3Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 95
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Chapter 7 Transitional Adaptation of Pretrained Models forVisual Storytelling
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.3.1The Visual Encoder . . . . . . . . . . . . . . . . . . . . . 104
7.3.2The Language Generator . . . . . . . . . . . . . . . . . . 104
7.3.3Adaptation training . . . . . . . . . . . . . . . . . . . . . 105
7.3.4The Sequential Coherence Loss . . . . . . . . . . . . . . . 105
7.3.5Training with the adaptation Loss . . . . . . . . . . . . . 107
7.3.6Fine-tuning and Inference . . . . . . . . . . . . . . . . . . 107
7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.4.1Experimental Setup . . . . . . . . . . . . . . . . . . . . . 109
7.4.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 112
7.4.3Further Analyses . . . . . . . . . . . . . . . . . . . . . . . 112
7.4.4Human Evaluation Results . . . . . . . . . . . . . . . . . 115
7.4.5Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 116
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Chapter 8 Conclusion
8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Bibliography ... 123
μμ½ ... 148
Acknowledgements ... 150Docto
A Survey of Natural Language Generation
This paper offers a comprehensive review of the research on Natural Language
Generation (NLG) over the past two decades, especially in relation to
data-to-text generation and text-to-text generation deep learning methods, as
well as new applications of NLG technology. This survey aims to (a) give the
latest synthesis of deep learning research on the NLG core tasks, as well as
the architectures adopted in the field; (b) detail meticulously and
comprehensively various NLG tasks and datasets, and draw attention to the
challenges in NLG evaluation, focusing on different evaluation methods and
their relationships; (c) highlight some future emphasis and relatively recent
research issues that arise due to the increasing synergy between NLG and other
artificial intelligence areas, such as computer vision, text and computational
creativity.Comment: Accepted by ACM Computing Survey (CSUR) 202
Advancing Visual Grounding with Scene Knowledge: Benchmark and Method
Visual grounding (VG) aims to establish fine-grained alignment between vision
and language. Ideally, it can be a testbed for vision-and-language models to
evaluate their understanding of the images and texts and their reasoning
abilities over their joint space. However, most existing VG datasets are
constructed using simple description texts, which do not require sufficient
reasoning over the images and texts. This has been demonstrated in a recent
study~\cite{luo2022goes}, where a simple LSTM-based text encoder without
pretraining can achieve state-of-the-art performance on mainstream VG datasets.
Therefore, in this paper, we propose a novel benchmark of \underline{S}cene
\underline{K}nowledge-guided \underline{V}isual \underline{G}rounding (SK-VG),
where the image content and referring expressions are not sufficient to ground
the target objects, forcing the models to have a reasoning ability on the
long-form scene knowledge. To perform this task, we propose two approaches to
accept the triple-type input, where the former embeds knowledge into the image
features before the image-query interaction; the latter leverages linguistic
structure to assist in computing the image-text matching. We conduct extensive
experiments to analyze the above methods and show that the proposed approaches
achieve promising results but still leave room for improvement, including
performance and interpretability. The dataset and code are available at
\url{https://github.com/zhjohnchan/SK-VG}.Comment: Computer Vision and Natural Language Processing. 21 pages, 14
figures. CVPR-202
- β¦