72 research outputs found
Learning Video Object Segmentation with Visual Memory
This paper addresses the task of segmenting moving objects in unconstrained
videos. We introduce a novel two-stream neural network with an explicit memory
module to achieve this. The two streams of the network encode spatial and
temporal features in a video sequence respectively, while the memory module
captures the evolution of objects over time. The module to build a "visual
memory" in video, i.e., a joint representation of all the video frames, is
realized with a convolutional recurrent unit learned from a small number of
training video sequences. Given a video frame as input, our approach assigns
each pixel an object or background label based on the learned spatio-temporal
features as well as the "visual memory" specific to the video, acquired
automatically without any manually-annotated frames. The visual memory is
implemented with convolutional gated recurrent units, which allows to propagate
spatial information over time. We evaluate our method extensively on two
benchmarks, DAVIS and Freiburg-Berkeley motion segmentation datasets, and show
state-of-the-art results. For example, our approach outperforms the top method
on the DAVIS dataset by nearly 6%. We also provide an extensive ablative
analysis to investigate the influence of each component in the proposed
framework
μ΄μΌκΈ°ν μ€λͺ λ¬Έμ νμ©ν λκ·λͺ¨ λΉλμ€ νμ΅ μ°κ΅¬
νμλ
Όλ¬Έ (λ°μ¬) -- μμΈλνκ΅ λνμ : 곡과λν μ»΄ν¨ν°κ³΅νλΆ, 2021. 2. κΉκ±΄ν¬.Extensive contributions are being made to develop intelligent agents that can recognize and communicate with the world. In this sense, various video-language tasks have drawn a lot of interests in computer vision research, including image/video captioning, video retrieval and video question answering.
It can be applied to high-level computer vision tasks and various future industries such as search engines, social marketing, automated driving, and robotics support through QA / dialog generation for the surrounding environment.
However, despite these developments, video-language learning suffers from a higher degree of complexity.
This thesis investigates methodologies for learning the relationship between videos and free-formed languages, including explanations, conversations, and question-and-answers, so that the machine can easily adapt to target downstream tasks.
First, we introduce several methods to learn the relationship between long sentences and videos efficiently. We introduce the approaches for supervising human attention transfer for the video attention model, which shows the video attention mechanism can benefit from explicit human gaze labels. Next, we introduce the end-to-end semantic attention method, which further reduces the visual attention algorithm's complexity by using the representative visual concept word detected by the attention-based detector. As a follow-up study on previous methods, we introduce a JSFusion (Joint Sequence Fusion) method that enables efficient video search and QA by enabling many-to-many matching of attention model.
Next, we introduce the CiSIN(Character in Story Identification Network), which uses Attention to increase the performance of character grounding and character re-identification in the movie. Finally, we introduce Transitional Adaptation, which promotes the caption generation models to generates coherent narratives for long videos.
In summary, this thesis presents a novel approaches for automatic video description generation/retrieval and shows the benefits of extracting linguistic knowledge for object and motion in the video as well as the advantage of multimodal audio-visual learning for understanding videos. Since the proposed methods are easily adapted to any video-language tasks, it is expected to be applied to the latest models, bringing additional performance improvements.
Moving forward, we plan to design an unsupervised video learning framework that can solve many challenges in the industry by integrating an unlimited amount of video, audio, and free-formed language data from the web.μκ°-μΈμ΄ νμ΅μ μ΄λ―Έμ§/λΉλμ€ μΊ‘μ
(Image/Video captioning), μκ° μ§μμλ΅(Visual Question and Answering), λΉλμ€ κ²μ(Video Retrieval), μ₯λ©΄ μ΄ν΄(scene understanding), μ΄λ²€νΈ μΈμ(event detection) λ± κ³ μ°¨μμ μ»΄ν¨ν° λΉμ νμ€ν¬(task)λΏλ§ μλλΌ μ£Όλ³ νκ²½μ λν μ§μ μλ΅ λ° λν μμ±(Dialogue Generation)μΌλ‘ μΈν°λ· κ²μ λΏλ§ μλλΌ μ΅κ·Ό νλ°ν μμ
λ§μΌν
(Social Marketing) μμ¨ μ£Όν(Automated Driving), λ‘보ν±μ€(Robotics)μ 보쑰νλ λ± μ¬λ¬ λ―Έλ μ°μ
μ μ μ©λ μ μμ΄ νλ°ν μ°κ΅¬λκ³ μλ μ€μν λΆμΌμ΄λ€.
μ»΄ν¨ν° λΉμ Όκ³Ό μμ°μ΄ μ²λ¦¬λ μ΄λ¬ν μ€μμ±μ λ°νμΌλ‘ κ°μ κ³ μ ν μμμμ λ°μ μ κ±°λν΄ μμΌλ, μ΅κ·Ό λ₯λ¬λμ λ±μ₯κ³Ό ν¨κ» λλΆμκ² λ°μ νλ©΄μ μλ‘λ₯Ό 보μνλ©° νμ΅ κ²°κ³Όλ₯Ό ν₯μμν€λ λ± ν° μλμ§ ν¨κ³Όλ₯Ό λ°ννκ² λμλ€.
νμ§λ§ μ΄λ° λ°μ μλ λΆκ΅¬νκ³ , λΉλμ€-μΈμ΄κ° νμ΅μ λ¬Έμ μ 볡μ‘λκ° νμΈ΅ λμ μ΄λ €μμ κ²ͺκ² λλ κ²½μ°κ° λ§λ€.
λ³Έ λ
Όλ¬Έμμλ λΉλμ€μ μ΄μ λμνλ μ€λͺ
, λν, μ§μ μλ΅ λ± λ λμκ° μμ ννμ μΈμ΄ (Free-formed language)κ°μ κ΄κ³λ₯Ό λμ± ν¨μ¨μ μΌλ‘ νμ΅νκ³ , λͺ©ν μ무μ μ λμν μ μλλ‘ κ°μ νλ κ²μ λͺ©νλ‘ νλ€.
λ¨Όμ , μκ°μ 볡μ‘λκ° μ΄λ―Έμ§λ³΄λ€ λμ λΉλμ€μ κΈ΄ λ¬Έμ₯ μ¬μ΄μ κ΄κ³λ₯Ό ν¨μ¨μ μΌλ‘ νμ΅νκΈ° μν μ¬λ¬ λ°©λ²λ€μ μκ°νλ€. μΈκ°μ μ£Όμ μΈμ(Attention) λͺ¨λΈμ λΉλμ€-μΈμ΄ λͺ¨λΈμ μ§λ νμ΅ νλ λ°©λ²μ μκ°νκ³ , μ΄μ΄μ λΉλμ€μμ μ°μ κ²μΆλ λν μκ° λ¨μ΄λ₯Ό 맀κ°λ‘ νμ¬ μ£Όμ μΈμ(Attention) μκ³ λ¦¬μ¦μ 볡μ‘λλ₯Ό λμ± μ€μ΄λ μλ―Έ μ€μ¬ μ£Όμ μΈμ (Semantic Attention) λ°©λ², μ΄ν
μ
λͺ¨λΈμ λ€λλ€ λ§€μΉμ κΈ°λ°μΌλ‘ ν¨μ¨μ μΈ λΉλμ€ κ²μ λ° μ§μμλ΅μ κ°λ₯μΌ νλ λΉλμ€-μΈμ΄κ° μ΅ν© (Joint Sequence Fusion) λ°©λ² λ± λΉλμ€ μ£Όμ μΈμμ ν¨μ¨μ μΌλ‘ νμ΅μν¬ μ μλ λ°©λ²λ€μ μ μνλ€.
λ€μμΌλ‘λ, μ£Όμ μΈμ(Attention) λͺ¨λΈμ΄ 물체-λ¨μ΄ κ° κ΄κ³λ₯Ό λμ΄ λΉλμ€ μμμ μΈλ¬Ό κ²μ (Person Searching) κ·Έλ¦¬κ³ μΈλ¬Ό μ¬ μλ³ (Person Re-Identification)μ λμμ μννλ©° μμΉμμ©μ μΌμΌν€λ μ€ν 리 μ μΊλ¦ν° μΈμ μ κ²½λ§ (Character in Story Identification Network) μ μκ°νλ©°, λ§μ§λ§μΌλ‘ μκΈ° μ§λ νμ΅(Self-supervised Learning)μ ν΅ν΄ μ£Όμ μΈμ(Attention) κΈ°λ° μΈμ΄ λͺ¨λΈμ΄ κΈ΄ λΉλμ€μ λν μ€λͺ
μ μ°κ΄μ± μκ² μ μμ±ν μ μλλ‘ μ λνλ λ°©λ²μ μκ°νλ€.
μμ½νμλ©΄, μ΄ νμ λ
Όλ¬Έμμ μ μν μλ‘μ΄ λ°©λ²λ‘ λ€μ λΉλμ€-μΈμ΄ νμ΅μ ν΄λΉνλ λΉλμ€ μΊ‘μ
(Video captioning), λΉλμ€ κ²μ(Video Retrieval), μκ° μ§μμλ΅(Video Question and Answering)λ±μ ν΄κ²°ν μ μλ κΈ°μ μ λλ€λμ΄ λλ©°, λΉλμ€ μΊ‘μ
νμ΅μ ν΅ν΄ νμ΅λ μ£Όμ μΈμ λͺ¨λμ κ²μ λ° μ§μμλ΅, μΈλ¬Ό κ²μ λ± κ° λ€νΈμν¬μ μ΄μλλ©΄μ μλ‘μ΄ λ¬Έμ λ€μ λν΄ λμμ μ΅κ³ μμ€(State-of-the-art)μ μ±λ₯μ λ¬μ±νμλ€. μ΄λ₯Ό ν΅ν΄ λΉλμ€-μΈμ΄ νμ΅μΌλ‘ μ»μ μΈμ΄ μ§μμ μ΄μ μ μκ°-μ²κ°μ μμ°λ₯΄λ λΉλμ€ λ©ν°λͺ¨λ¬ νμ΅μ ν° λμμ΄ λλ κ²μ μ€νμ μΌλ‘ 보μ¬μ€λ€. ν₯ν μμ
λ°©ν₯ (Future Work)μΌλ‘λ μμ μ°κ΅¬ν λ΄μ©λ€μ κΈ°λ°μΌλ‘ μΉ μμ μ‘΄μ¬νλ λκ·λͺ¨μ μΈμ΄, λΉλμ€, μ€λμ€ λ°μ΄ν°λ₯Ό ν΅ν©ν΄ νμ΅μ νμ©νμ¬ μ°μ
κ³μ λ§μ λμ λ₯Ό ν΄κ²°ν μ μλ λΉμ§λ νμ΅ λͺ¨λΈμ λ§λ€κ³ μ νλ€.Chapter 1
Introduction
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4
1.2 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . .8
Chapter 2
Related Work
2.1 Video Captioning . . . . . . . . . . . . . . . . . . . . . . . . . . .9
2.2 Video Retrieval with Natural Language . . . . . . . . . . . . . . 12
2.3 Video Question and Answering . . . . . . . . . . . . . . . . . . . 13
2.4 Cross-modal Representation Learning for Vision and LanguageTasks . . . . 15
Chapter 3 Human Attention Transfer for Video Captioning18
3.1 Introduction
3.2 Video Datasets for Caption and Gaze . . . . . . . . . . . . . . . 21
3.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Video Pre-processing and Description . . . . . . . . . . . 22
3.3.2The Recurrent Gaze Prediction (RGP) Model . . . . . . . 23
3.3.3Construction of Visual Feature Pools . . . . . . . . . . . . 24
3.3.4The Decoder for Caption Generation . . . . . . . . . . . . 26
3.3.5Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.1Evaluation of Gaze Prediction . . . . . . . . . . . . . . . . 29
3.4.2Evaluation of Video Captioning . . . . . . . . . . . . . . . 32
3.4.3Human Evaluation via AMT . . . . . . . . . . . . . . . . 35
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Chapter 4 Semantic Word Attention for Video QA and VideoCaptioning
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.1Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.2Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.1Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.2An Attention Model for Concept Detection . . . . . . . . 42
4.2.3Video-to-Language Models . . . . . . . . . . . . . . . . . 45
4.2.4A Model for Description . . . . . . . . . . . . . . . . . . . 45
4.2.5A Model for Fill-in-the-Blank . . . . . . . . . . . . . . . . 48
4.2.6A Model for Multiple-Choice Test . . . . . . . . . . . . . 50
4.2.7A Model for Retrieval . . . . . . . . . . . . . . . . . . . . 51
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.1The LSMDC Dataset and Tasks . . . . . . . . . . . . . . 52
4.3.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 54
4.3.3Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Chapter 5 Joint Sequnece Fusion Attention for Multimodal Sequence Data
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.1Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.2The Joint Semantic Tensor . . . . . . . . . . . . . . . . . 65
5.3.3The Convolutional Hierarchical Decoder . . . . . . . . . . 66
5.3.4An Illustrative Example of How the JSFusion Model Works 68
5.3.5Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3.6Implementation of Video-Language Models . . . . . . . . 69
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.1LSMDC Dataset and Tasks . . . . . . . . . . . . . . . . . 71
5.4.2MSR-VTT-(RET/MC) Dataset and Tasks . . . . . . . . . 73
5.4.3Quantitative Results . . . . . . . . . . . . . . . . . . . . . 74
5.4.4Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 76
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Chapter 6 Character Re-Identification and Character Ground-ing for Movie Understanding
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3.1Video Preprocessing . . . . . . . . . . . . . . . . . . . . . 84
6.3.2Visual Track Embedding . . . . . . . . . . . . . . . . . . . 85
6.3.3Textual Character Embedding . . . . . . . . . . . . . . . 86
6.3.4Character Grounding . . . . . . . . . . . . . . . . . . . . 87
6.3.5Re-Identification . . . . . . . . . . . . . . . . . . . . . . . 88
6.3.6Joint Training . . . . . . . . . . . . . . . . . . . . . . . . 90
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.4.1Experimental Setup . . . . . . . . . . . . . . . . . . . . . 92
6.4.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 93
6.4.3Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 95
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Chapter 7 Transitional Adaptation of Pretrained Models forVisual Storytelling
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.3.1The Visual Encoder . . . . . . . . . . . . . . . . . . . . . 104
7.3.2The Language Generator . . . . . . . . . . . . . . . . . . 104
7.3.3Adaptation training . . . . . . . . . . . . . . . . . . . . . 105
7.3.4The Sequential Coherence Loss . . . . . . . . . . . . . . . 105
7.3.5Training with the adaptation Loss . . . . . . . . . . . . . 107
7.3.6Fine-tuning and Inference . . . . . . . . . . . . . . . . . . 107
7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.4.1Experimental Setup . . . . . . . . . . . . . . . . . . . . . 109
7.4.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 112
7.4.3Further Analyses . . . . . . . . . . . . . . . . . . . . . . . 112
7.4.4Human Evaluation Results . . . . . . . . . . . . . . . . . 115
7.4.5Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 116
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Chapter 8 Conclusion
8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Bibliography ... 123
μμ½ ... 148
Acknowledgements ... 150Docto
Salient Object Detection for Images Taken by People With Vision Impairments
Salient object detection is the task of producing a binary mask for an image
that deciphers which pixels belong to the foreground object versus background.
We introduce a new salient object detection dataset using images taken by
people who are visually impaired who were seeking to better understand their
surroundings, which we call VizWiz-SalientObject. Compared to seven existing
datasets, VizWiz-SalientObject is the largest (i.e., 32,000 human-annotated
images) and contains unique characteristics including a higher prevalence of
text in the salient objects (i.e., in 68\% of images) and salient objects that
occupy a larger ratio of the images (i.e., on average, 50\% coverage). We
benchmarked seven modern salient object detection methods on our dataset and
found they struggle most with images featuring salient objects that are large,
have less complex boundaries, and lack text as well as for lower quality
images. We invite the broader community to work on our new dataset challenge by
publicly sharing the dataset at
https://vizwiz.org/tasks-and-datasets/salient-object .Comment: Computer Vision and Pattern Recognitio
Understanding Video Transformers for Segmentation: A Survey of Application and Interpretability
Video segmentation encompasses a wide range of categories of problem
formulation, e.g., object, scene, actor-action and multimodal video
segmentation, for delineating task-specific scene components with pixel-level
masks. Recently, approaches in this research area shifted from concentrating on
ConvNet-based to transformer-based models. In addition, various
interpretability approaches have appeared for transformer models and video
temporal dynamics, motivated by the growing interest in basic scientific
understanding, model diagnostics and societal implications of real-world
deployment. Previous surveys mainly focused on ConvNet models on a subset of
video segmentation tasks or transformers for classification tasks. Moreover,
component-wise discussion of transformer-based video segmentation models has
not yet received due focus. In addition, previous reviews of interpretability
methods focused on transformers for classification, while analysis of video
temporal dynamics modelling capabilities of video models received less
attention. In this survey, we address the above with a thorough discussion of
various categories of video segmentation, a component-wise discussion of the
state-of-the-art transformer-based models, and a review of related
interpretability methods. We first present an introduction to the different
video segmentation task categories, their objectives, specific challenges and
benchmark datasets. Next, we provide a component-wise review of recent
transformer-based models and document the state of the art on different video
segmentation tasks. Subsequently, we discuss post-hoc and ante-hoc
interpretability methods for transformer models and interpretability methods
for understanding the role of the temporal dimension in video models. Finally,
we conclude our discussion with future research directions
- β¦