476 research outputs found
Sequence to Sequence -- Video to Text
Real-world videos often have complex dynamics; and methods for generating
open-domain video descriptions should be sensitive to temporal structure and
allow both input (sequence of frames) and output (sequence of words) of
variable length. To approach this problem, we propose a novel end-to-end
sequence-to-sequence model to generate captions for videos. For this we exploit
recurrent neural networks, specifically LSTMs, which have demonstrated
state-of-the-art performance in image caption generation. Our LSTM model is
trained on video-sentence pairs and learns to associate a sequence of video
frames to a sequence of words in order to generate a description of the event
in the video clip. Our model naturally is able to learn the temporal structure
of the sequence of frames as well as the sequence model of the generated
sentences, i.e. a language model. We evaluate several variants of our model
that exploit different visual features on a standard set of YouTube videos and
two movie description datasets (M-VAD and MPII-MD).Comment: ICCV 2015 camera-ready. Includes code, project page and LSMDC
challenge result
Tracking by Prediction: A Deep Generative Model for Mutli-Person localisation and Tracking
Current multi-person localisation and tracking systems have an over reliance
on the use of appearance models for target re-identification and almost no
approaches employ a complete deep learning solution for both objectives. We
present a novel, complete deep learning framework for multi-person localisation
and tracking. In this context we first introduce a light weight sequential
Generative Adversarial Network architecture for person localisation, which
overcomes issues related to occlusions and noisy detections, typically found in
a multi person environment. In the proposed tracking framework we build upon
recent advances in pedestrian trajectory prediction approaches and propose a
novel data association scheme based on predicted trajectories. This removes the
need for computationally expensive person re-identification systems based on
appearance features and generates human like trajectories with minimal
fragmentation. The proposed method is evaluated on multiple public benchmarks
including both static and dynamic cameras and is capable of generating
outstanding performance, especially among other recently proposed deep neural
network based approaches.Comment: To appear in IEEE Winter Conference on Applications of Computer
Vision (WACV), 201
Movie Description
Audio Description (AD) provides linguistic descriptions of movies and allows
visually impaired people to follow a movie along with their peers. Such
descriptions are by design mainly visual and thus naturally form an interesting
data source for computer vision and computational linguistics. In this work we
propose a novel dataset which contains transcribed ADs, which are temporally
aligned to full length movies. In addition we also collected and aligned movie
scripts used in prior work and compare the two sources of descriptions. In
total the Large Scale Movie Description Challenge (LSMDC) contains a parallel
corpus of 118,114 sentences and video clips from 202 movies. First we
characterize the dataset by benchmarking different approaches for generating
video descriptions. Comparing ADs to scripts, we find that ADs are indeed more
visual and describe precisely what is shown rather than what should happen
according to the scripts created prior to movie production. Furthermore, we
present and compare the results of several teams who participated in a
challenge organized in the context of the workshop "Describing and
Understanding Video & The Large Scale Movie Description Challenge (LSMDC)", at
ICCV 2015
Deep Architectures for Visual Recognition and Description
In recent times, digital media contents are inherently of multimedia type, consisting of the form text, audio, image and video. Several of the outstanding computer Vision (CV) problems are being successfully solved with the help of modern Machine Learning (ML) techniques. Plenty of research work has already been carried out in the field of Automatic Image Annotation (AIA), Image Captioning and Video Tagging. Video Captioning, i.e., automatic description generation from digital video, however, is a different and complex problem altogether. This study compares various existing video captioning approaches available today and attempts their classification and analysis based on different parameters, viz., type of captioning methods (generation/retrieval), type of learning models employed, the desired output description length generated, etc. This dissertation also attempts to critically analyze the existing benchmark datasets used in various video captioning models and the evaluation metrics for assessing the final quality of the resultant video descriptions generated. A detailed study of important existing models, highlighting their comparative advantages as well as disadvantages are also included.
In this study a novel approach for video captioning on the Microsoft Video Description (MSVD) dataset and Microsoft Video-to-Text (MSR-VTT) dataset is proposed using supervised learning techniques to train a deep combinational framework, for achieving better quality video captioning via predicting semantic tags. We develop simple shallow CNN (2D and 3D) as feature extractors, Deep Neural Networks (DNNs and Bidirectional LSTMs (BiLSTMs) as tag prediction models and Recurrent Neural Networks (RNNs) (LSTM) model as the language model. The aim of the work was to provide an alternative narrative to generating captions from videos via semantic tag predictions and deploy simpler shallower deep model architectures with lower memory requirements as solution so that it is not very memory extensive and the developed models prove to be stable and viable options when the scale of the data is increased.
This study also successfully employed deep architectures like the Convolutional Neural Network (CNN) for speeding up automation process of hand gesture recognition and classification of the sign languages of the Indian classical dance form, βBharatnatyamβ. This hand gesture classification is primarily aimed at 1) building a novel dataset of 2D single hand gestures belonging to 27 classes that were collected from (i) Google search engine (Google images), (ii) YouTube videos (dynamic and with background considered) and (iii) professional artists under staged environment constraints (plain backgrounds). 2) exploring the effectiveness of CNNs for identifying and classifying the single hand gestures by optimizing the hyperparameters, and 3) evaluating the impacts of transfer learning and double transfer learning, which is a novel concept explored for achieving higher classification accuracy
μ΄μΌκΈ°ν μ€λͺ λ¬Έμ νμ©ν λκ·λͺ¨ λΉλμ€ νμ΅ μ°κ΅¬
νμλ
Όλ¬Έ (λ°μ¬) -- μμΈλνκ΅ λνμ : 곡과λν μ»΄ν¨ν°κ³΅νλΆ, 2021. 2. κΉκ±΄ν¬.Extensive contributions are being made to develop intelligent agents that can recognize and communicate with the world. In this sense, various video-language tasks have drawn a lot of interests in computer vision research, including image/video captioning, video retrieval and video question answering.
It can be applied to high-level computer vision tasks and various future industries such as search engines, social marketing, automated driving, and robotics support through QA / dialog generation for the surrounding environment.
However, despite these developments, video-language learning suffers from a higher degree of complexity.
This thesis investigates methodologies for learning the relationship between videos and free-formed languages, including explanations, conversations, and question-and-answers, so that the machine can easily adapt to target downstream tasks.
First, we introduce several methods to learn the relationship between long sentences and videos efficiently. We introduce the approaches for supervising human attention transfer for the video attention model, which shows the video attention mechanism can benefit from explicit human gaze labels. Next, we introduce the end-to-end semantic attention method, which further reduces the visual attention algorithm's complexity by using the representative visual concept word detected by the attention-based detector. As a follow-up study on previous methods, we introduce a JSFusion (Joint Sequence Fusion) method that enables efficient video search and QA by enabling many-to-many matching of attention model.
Next, we introduce the CiSIN(Character in Story Identification Network), which uses Attention to increase the performance of character grounding and character re-identification in the movie. Finally, we introduce Transitional Adaptation, which promotes the caption generation models to generates coherent narratives for long videos.
In summary, this thesis presents a novel approaches for automatic video description generation/retrieval and shows the benefits of extracting linguistic knowledge for object and motion in the video as well as the advantage of multimodal audio-visual learning for understanding videos. Since the proposed methods are easily adapted to any video-language tasks, it is expected to be applied to the latest models, bringing additional performance improvements.
Moving forward, we plan to design an unsupervised video learning framework that can solve many challenges in the industry by integrating an unlimited amount of video, audio, and free-formed language data from the web.μκ°-μΈμ΄ νμ΅μ μ΄λ―Έμ§/λΉλμ€ μΊ‘μ
(Image/Video captioning), μκ° μ§μμλ΅(Visual Question and Answering), λΉλμ€ κ²μ(Video Retrieval), μ₯λ©΄ μ΄ν΄(scene understanding), μ΄λ²€νΈ μΈμ(event detection) λ± κ³ μ°¨μμ μ»΄ν¨ν° λΉμ νμ€ν¬(task)λΏλ§ μλλΌ μ£Όλ³ νκ²½μ λν μ§μ μλ΅ λ° λν μμ±(Dialogue Generation)μΌλ‘ μΈν°λ· κ²μ λΏλ§ μλλΌ μ΅κ·Ό νλ°ν μμ
λ§μΌν
(Social Marketing) μμ¨ μ£Όν(Automated Driving), λ‘보ν±μ€(Robotics)μ 보쑰νλ λ± μ¬λ¬ λ―Έλ μ°μ
μ μ μ©λ μ μμ΄ νλ°ν μ°κ΅¬λκ³ μλ μ€μν λΆμΌμ΄λ€.
μ»΄ν¨ν° λΉμ Όκ³Ό μμ°μ΄ μ²λ¦¬λ μ΄λ¬ν μ€μμ±μ λ°νμΌλ‘ κ°μ κ³ μ ν μμμμ λ°μ μ κ±°λν΄ μμΌλ, μ΅κ·Ό λ₯λ¬λμ λ±μ₯κ³Ό ν¨κ» λλΆμκ² λ°μ νλ©΄μ μλ‘λ₯Ό 보μνλ©° νμ΅ κ²°κ³Όλ₯Ό ν₯μμν€λ λ± ν° μλμ§ ν¨κ³Όλ₯Ό λ°ννκ² λμλ€.
νμ§λ§ μ΄λ° λ°μ μλ λΆκ΅¬νκ³ , λΉλμ€-μΈμ΄κ° νμ΅μ λ¬Έμ μ 볡μ‘λκ° νμΈ΅ λμ μ΄λ €μμ κ²ͺκ² λλ κ²½μ°κ° λ§λ€.
λ³Έ λ
Όλ¬Έμμλ λΉλμ€μ μ΄μ λμνλ μ€λͺ
, λν, μ§μ μλ΅ λ± λ λμκ° μμ ννμ μΈμ΄ (Free-formed language)κ°μ κ΄κ³λ₯Ό λμ± ν¨μ¨μ μΌλ‘ νμ΅νκ³ , λͺ©ν μ무μ μ λμν μ μλλ‘ κ°μ νλ κ²μ λͺ©νλ‘ νλ€.
λ¨Όμ , μκ°μ 볡μ‘λκ° μ΄λ―Έμ§λ³΄λ€ λμ λΉλμ€μ κΈ΄ λ¬Έμ₯ μ¬μ΄μ κ΄κ³λ₯Ό ν¨μ¨μ μΌλ‘ νμ΅νκΈ° μν μ¬λ¬ λ°©λ²λ€μ μκ°νλ€. μΈκ°μ μ£Όμ μΈμ(Attention) λͺ¨λΈμ λΉλμ€-μΈμ΄ λͺ¨λΈμ μ§λ νμ΅ νλ λ°©λ²μ μκ°νκ³ , μ΄μ΄μ λΉλμ€μμ μ°μ κ²μΆλ λν μκ° λ¨μ΄λ₯Ό 맀κ°λ‘ νμ¬ μ£Όμ μΈμ(Attention) μκ³ λ¦¬μ¦μ 볡μ‘λλ₯Ό λμ± μ€μ΄λ μλ―Έ μ€μ¬ μ£Όμ μΈμ (Semantic Attention) λ°©λ², μ΄ν
μ
λͺ¨λΈμ λ€λλ€ λ§€μΉμ κΈ°λ°μΌλ‘ ν¨μ¨μ μΈ λΉλμ€ κ²μ λ° μ§μμλ΅μ κ°λ₯μΌ νλ λΉλμ€-μΈμ΄κ° μ΅ν© (Joint Sequence Fusion) λ°©λ² λ± λΉλμ€ μ£Όμ μΈμμ ν¨μ¨μ μΌλ‘ νμ΅μν¬ μ μλ λ°©λ²λ€μ μ μνλ€.
λ€μμΌλ‘λ, μ£Όμ μΈμ(Attention) λͺ¨λΈμ΄ 물체-λ¨μ΄ κ° κ΄κ³λ₯Ό λμ΄ λΉλμ€ μμμ μΈλ¬Ό κ²μ (Person Searching) κ·Έλ¦¬κ³ μΈλ¬Ό μ¬ μλ³ (Person Re-Identification)μ λμμ μννλ©° μμΉμμ©μ μΌμΌν€λ μ€ν 리 μ μΊλ¦ν° μΈμ μ κ²½λ§ (Character in Story Identification Network) μ μκ°νλ©°, λ§μ§λ§μΌλ‘ μκΈ° μ§λ νμ΅(Self-supervised Learning)μ ν΅ν΄ μ£Όμ μΈμ(Attention) κΈ°λ° μΈμ΄ λͺ¨λΈμ΄ κΈ΄ λΉλμ€μ λν μ€λͺ
μ μ°κ΄μ± μκ² μ μμ±ν μ μλλ‘ μ λνλ λ°©λ²μ μκ°νλ€.
μμ½νμλ©΄, μ΄ νμ λ
Όλ¬Έμμ μ μν μλ‘μ΄ λ°©λ²λ‘ λ€μ λΉλμ€-μΈμ΄ νμ΅μ ν΄λΉνλ λΉλμ€ μΊ‘μ
(Video captioning), λΉλμ€ κ²μ(Video Retrieval), μκ° μ§μμλ΅(Video Question and Answering)λ±μ ν΄κ²°ν μ μλ κΈ°μ μ λλ€λμ΄ λλ©°, λΉλμ€ μΊ‘μ
νμ΅μ ν΅ν΄ νμ΅λ μ£Όμ μΈμ λͺ¨λμ κ²μ λ° μ§μμλ΅, μΈλ¬Ό κ²μ λ± κ° λ€νΈμν¬μ μ΄μλλ©΄μ μλ‘μ΄ λ¬Έμ λ€μ λν΄ λμμ μ΅κ³ μμ€(State-of-the-art)μ μ±λ₯μ λ¬μ±νμλ€. μ΄λ₯Ό ν΅ν΄ λΉλμ€-μΈμ΄ νμ΅μΌλ‘ μ»μ μΈμ΄ μ§μμ μ΄μ μ μκ°-μ²κ°μ μμ°λ₯΄λ λΉλμ€ λ©ν°λͺ¨λ¬ νμ΅μ ν° λμμ΄ λλ κ²μ μ€νμ μΌλ‘ 보μ¬μ€λ€. ν₯ν μμ
λ°©ν₯ (Future Work)μΌλ‘λ μμ μ°κ΅¬ν λ΄μ©λ€μ κΈ°λ°μΌλ‘ μΉ μμ μ‘΄μ¬νλ λκ·λͺ¨μ μΈμ΄, λΉλμ€, μ€λμ€ λ°μ΄ν°λ₯Ό ν΅ν©ν΄ νμ΅μ νμ©νμ¬ μ°μ
κ³μ λ§μ λμ λ₯Ό ν΄κ²°ν μ μλ λΉμ§λ νμ΅ λͺ¨λΈμ λ§λ€κ³ μ νλ€.Chapter 1
Introduction
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4
1.2 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . .8
Chapter 2
Related Work
2.1 Video Captioning . . . . . . . . . . . . . . . . . . . . . . . . . . .9
2.2 Video Retrieval with Natural Language . . . . . . . . . . . . . . 12
2.3 Video Question and Answering . . . . . . . . . . . . . . . . . . . 13
2.4 Cross-modal Representation Learning for Vision and LanguageTasks . . . . 15
Chapter 3 Human Attention Transfer for Video Captioning18
3.1 Introduction
3.2 Video Datasets for Caption and Gaze . . . . . . . . . . . . . . . 21
3.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Video Pre-processing and Description . . . . . . . . . . . 22
3.3.2The Recurrent Gaze Prediction (RGP) Model . . . . . . . 23
3.3.3Construction of Visual Feature Pools . . . . . . . . . . . . 24
3.3.4The Decoder for Caption Generation . . . . . . . . . . . . 26
3.3.5Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.1Evaluation of Gaze Prediction . . . . . . . . . . . . . . . . 29
3.4.2Evaluation of Video Captioning . . . . . . . . . . . . . . . 32
3.4.3Human Evaluation via AMT . . . . . . . . . . . . . . . . 35
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Chapter 4 Semantic Word Attention for Video QA and VideoCaptioning
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.1Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.2Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.1Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.2An Attention Model for Concept Detection . . . . . . . . 42
4.2.3Video-to-Language Models . . . . . . . . . . . . . . . . . 45
4.2.4A Model for Description . . . . . . . . . . . . . . . . . . . 45
4.2.5A Model for Fill-in-the-Blank . . . . . . . . . . . . . . . . 48
4.2.6A Model for Multiple-Choice Test . . . . . . . . . . . . . 50
4.2.7A Model for Retrieval . . . . . . . . . . . . . . . . . . . . 51
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.1The LSMDC Dataset and Tasks . . . . . . . . . . . . . . 52
4.3.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 54
4.3.3Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Chapter 5 Joint Sequnece Fusion Attention for Multimodal Sequence Data
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.1Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.2The Joint Semantic Tensor . . . . . . . . . . . . . . . . . 65
5.3.3The Convolutional Hierarchical Decoder . . . . . . . . . . 66
5.3.4An Illustrative Example of How the JSFusion Model Works 68
5.3.5Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3.6Implementation of Video-Language Models . . . . . . . . 69
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.1LSMDC Dataset and Tasks . . . . . . . . . . . . . . . . . 71
5.4.2MSR-VTT-(RET/MC) Dataset and Tasks . . . . . . . . . 73
5.4.3Quantitative Results . . . . . . . . . . . . . . . . . . . . . 74
5.4.4Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 76
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Chapter 6 Character Re-Identification and Character Ground-ing for Movie Understanding
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3.1Video Preprocessing . . . . . . . . . . . . . . . . . . . . . 84
6.3.2Visual Track Embedding . . . . . . . . . . . . . . . . . . . 85
6.3.3Textual Character Embedding . . . . . . . . . . . . . . . 86
6.3.4Character Grounding . . . . . . . . . . . . . . . . . . . . 87
6.3.5Re-Identification . . . . . . . . . . . . . . . . . . . . . . . 88
6.3.6Joint Training . . . . . . . . . . . . . . . . . . . . . . . . 90
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.4.1Experimental Setup . . . . . . . . . . . . . . . . . . . . . 92
6.4.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 93
6.4.3Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 95
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Chapter 7 Transitional Adaptation of Pretrained Models forVisual Storytelling
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.3.1The Visual Encoder . . . . . . . . . . . . . . . . . . . . . 104
7.3.2The Language Generator . . . . . . . . . . . . . . . . . . 104
7.3.3Adaptation training . . . . . . . . . . . . . . . . . . . . . 105
7.3.4The Sequential Coherence Loss . . . . . . . . . . . . . . . 105
7.3.5Training with the adaptation Loss . . . . . . . . . . . . . 107
7.3.6Fine-tuning and Inference . . . . . . . . . . . . . . . . . . 107
7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.4.1Experimental Setup . . . . . . . . . . . . . . . . . . . . . 109
7.4.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 112
7.4.3Further Analyses . . . . . . . . . . . . . . . . . . . . . . . 112
7.4.4Human Evaluation Results . . . . . . . . . . . . . . . . . 115
7.4.5Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 116
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Chapter 8 Conclusion
8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Bibliography ... 123
μμ½ ... 148
Acknowledgements ... 150Docto
Deep Architectures for Content Moderation and Movie Content Rating
Rating a video based on its content is an important step for classifying
video age categories. Movie content rating and TV show rating are the two most
common rating systems established by professional committees. However, manually
reviewing and evaluating scene/film content by a committee is a tedious work
and it becomes increasingly difficult with the ever-growing amount of online
video content. As such, a desirable solution is to use computer vision based
video content analysis techniques to automate the evaluation process. In this
paper, related works are summarized for action recognition, multi-modal
learning, movie genre classification, and sensitive content detection in the
context of content moderation and movie content rating. The project page is
available at https://github.com/fcakyon/content-moderation-deep-learning
- β¦