252 research outputs found

    Hierarchical Boundary-Aware Neural Encoder for Video Captioning

    Get PDF
    The use of Recurrent Neural Networks for video captioning has recently gained a lot of attention, since they can be used both to encode the input video and to generate the corresponding description. In this paper, we present a recurrent video encoding scheme which can discover and leverage the hierarchical structure of the video. Unlike the classical encoder-decoder approach, in which a video is encoded continuously by a recurrent layer, we propose a novel LSTM cell, which can identify discontinuity points between frames or segments and modify the temporal connections of the encoding layer accordingly. We evaluate our approach on three large-scale datasets: the Montreal Video Annotation dataset, the MPII Movie Description dataset and the Microsoft Video Description Corpus. Experiments show that our approach can discover appropriate hierarchical representations of input videos and improve the state of the art results on movie description datasets

    A Hierarchical Quasi-Recurrent approach to Video Captioning

    Get PDF
    Video captioning has picked up a considerable measure of attention thanks to the use of Recurrent Neural Networks, since they can be utilized to both encode the input video and to create the corresponding description. In this paper, we present a recurrent video encoding scheme which can find and exploit the layered structure of the video. Differently from the established encoder-decoder approach, in which a video is encoded continuously by a recurrent layer, we propose to employ Quasi-Recurrent Neural Networks, further extending their basic cell with a boundary detector which can recognize discontinuity points between frames or segments and likewise modify the temporal connections of the encoding layer. We assess our approach on a large scale dataset, the Montreal Video Annotation dataset. Experiments demonstrate that our approach can find suitable levels of representation of the input information, while reducing the computational requirements

    Automatic Video Captioning using Deep Neural Network

    Get PDF
    Video understanding has become increasingly important as surveillance, social, and informational videos weave themselves into our everyday lives. Video captioning offers a simple way to summarize, index, and search the data. Most video captioning models utilize a video encoder and captioning decoder framework. Hierarchical encoders can abstractly capture clip level temporal features to represent a video, but the clips are at fixed time steps. This thesis research introduces two models: a hierarchical model with steered captioning, and a Multi-stream Hierarchical Boundary model. The steered captioning model is the first attention model to smartly guide an attention model to appropriate locations in a video by using visual attributes. The Multi-stream Hierarchical Boundary model combines a fixed hierarchy recurrent architecture with a soft hierarchy layer by using intrinsic feature boundary cuts within a video to define clips. This thesis also introduces a novel parametric Gaussian attention which removes the restriction of soft attention techniques which require fixed length video streams. By carefully incorporating Gaussian attention in designated layers, the proposed models demonstrate state-of-the-art video captioning results on recent datasets

    Multi-Modal Deep Learning to Understand Vision and Language

    Get PDF
    Developing intelligent agents that can perceive and understand the rich visual world around us has been a long-standing goal in the field of artificial intelligence. In the last few years, significant progress has been made towards this goal and deep learning has been attributed to recent incredible advances in general visual and language understanding. Convolutional neural networks have been used to learn image representations while recurrent neural networks have demonstrated the ability to generate text from visual stimuli. In this thesis, we develop methods and techniques using hybrid convolutional and recurrent neural network architectures that connect visual data and natural language utterances. Towards appreciating these methods, this work is divided into two broad groups. Firstly, we introduce a general purpose attention mechanism modeled using a continuous function for video understanding. The use of an attention based hierarchical approach along with automatic boundary detection advances state-of-the-art video captioning results. We also develop techniques for summarizing and annotating long videos. In the second part, we introduce architectures along with training techniques to produce a common connection space where natural language sentences are efficiently and accurately connected with visual modalities. In this connection space, similar concepts lie close, while dissimilar concepts lie far apart, irrespective` of their modality. We discuss four modality transformations: visual to text, text to visual, visual to visual and text to text. We introduce a novel attention mechanism to align multi-modal embeddings which are learned through a multi-modal metric loss function. The common vector space is shown to enable bidirectional generation of images and text. The learned common vector space is evaluated on multiple image-text datasets for cross-modal retrieval and zero-shot retrieval. The models are shown to advance the state-of-the-art on tasks that require joint processing of images and natural language

    ์ด์•ผ๊ธฐํ˜• ์„ค๋ช…๋ฌธ์„ ํ™œ์šฉํ•œ ๋Œ€๊ทœ๋ชจ ๋น„๋””์˜ค ํ•™์Šต ์—ฐ๊ตฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021. 2. ๊น€๊ฑดํฌ.Extensive contributions are being made to develop intelligent agents that can recognize and communicate with the world. In this sense, various video-language tasks have drawn a lot of interests in computer vision research, including image/video captioning, video retrieval and video question answering. It can be applied to high-level computer vision tasks and various future industries such as search engines, social marketing, automated driving, and robotics support through QA / dialog generation for the surrounding environment. However, despite these developments, video-language learning suffers from a higher degree of complexity. This thesis investigates methodologies for learning the relationship between videos and free-formed languages, including explanations, conversations, and question-and-answers, so that the machine can easily adapt to target downstream tasks. First, we introduce several methods to learn the relationship between long sentences and videos efficiently. We introduce the approaches for supervising human attention transfer for the video attention model, which shows the video attention mechanism can benefit from explicit human gaze labels. Next, we introduce the end-to-end semantic attention method, which further reduces the visual attention algorithm's complexity by using the representative visual concept word detected by the attention-based detector. As a follow-up study on previous methods, we introduce a JSFusion (Joint Sequence Fusion) method that enables efficient video search and QA by enabling many-to-many matching of attention model. Next, we introduce the CiSIN(Character in Story Identification Network), which uses Attention to increase the performance of character grounding and character re-identification in the movie. Finally, we introduce Transitional Adaptation, which promotes the caption generation models to generates coherent narratives for long videos. In summary, this thesis presents a novel approaches for automatic video description generation/retrieval and shows the benefits of extracting linguistic knowledge for object and motion in the video as well as the advantage of multimodal audio-visual learning for understanding videos. Since the proposed methods are easily adapted to any video-language tasks, it is expected to be applied to the latest models, bringing additional performance improvements. Moving forward, we plan to design an unsupervised video learning framework that can solve many challenges in the industry by integrating an unlimited amount of video, audio, and free-formed language data from the web.์‹œ๊ฐ-์–ธ์–ด ํ•™์Šต์€ ์ด๋ฏธ์ง€/๋น„๋””์˜ค ์บก์…˜(Image/Video captioning), ์‹œ๊ฐ ์งˆ์˜์‘๋‹ต(Visual Question and Answering), ๋น„๋””์˜ค ๊ฒ€์ƒ‰(Video Retrieval), ์žฅ๋ฉด ์ดํ•ด(scene understanding), ์ด๋ฒคํŠธ ์ธ์‹(event detection) ๋“ฑ ๊ณ ์ฐจ์›์˜ ์ปดํ“จํ„ฐ ๋น„์ „ ํƒœ์Šคํฌ(task)๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ฃผ๋ณ€ ํ™˜๊ฒฝ์— ๋Œ€ํ•œ ์งˆ์˜ ์‘๋‹ต ๋ฐ ๋Œ€ํ™” ์ƒ์„ฑ(Dialogue Generation)์œผ๋กœ ์ธํ„ฐ๋„ท ๊ฒ€์ƒ‰ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ตœ๊ทผ ํ™œ๋ฐœํ•œ ์†Œ์…œ ๋งˆ์ผ€ํŒ…(Social Marketing) ์ž์œจ ์ฃผํ–‰(Automated Driving), ๋กœ๋ณดํ‹ฑ์Šค(Robotics)์„ ๋ณด์กฐํ•˜๋Š” ๋“ฑ ์—ฌ๋Ÿฌ ๋ฏธ๋ž˜ ์‚ฐ์—…์— ์ ์šฉ๋  ์ˆ˜ ์žˆ์–ด ํ™œ๋ฐœํžˆ ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ๋Š” ์ค‘์š”ํ•œ ๋ถ„์•ผ์ด๋‹ค. ์ปดํ“จํ„ฐ ๋น„์ ผ๊ณผ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ๋Š” ์ด๋Ÿฌํ•œ ์ค‘์š”์„ฑ์„ ๋ฐ”ํƒ•์œผ๋กœ ๊ฐ์ž ๊ณ ์œ ํ•œ ์˜์—ญ์—์„œ ๋ฐœ์ „์„ ๊ฑฐ๋“ญํ•ด ์™”์œผ๋‚˜, ์ตœ๊ทผ ๋”ฅ๋Ÿฌ๋‹์˜ ๋“ฑ์žฅ๊ณผ ํ•จ๊ป˜ ๋ˆˆ๋ถ€์‹œ๊ฒŒ ๋ฐœ์ „ํ•˜๋ฉด์„œ ์„œ๋กœ๋ฅผ ๋ณด์™„ํ•˜๋ฉฐ ํ•™์Šต ๊ฒฐ๊ณผ๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋“ฑ ํฐ ์‹œ๋„ˆ์ง€ ํšจ๊ณผ๋ฅผ ๋ฐœํœ˜ํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฐ ๋ฐœ์ „์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ๋น„๋””์˜ค-์–ธ์–ด๊ฐ„ ํ•™์Šต์€ ๋ฌธ์ œ์˜ ๋ณต์žก๋„๊ฐ€ ํ•œ์ธต ๋†’์•„ ์–ด๋ ค์›€์„ ๊ฒช๊ฒŒ ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋น„๋””์˜ค์™€ ์ด์— ๋Œ€์‘ํ•˜๋Š” ์„ค๋ช…, ๋Œ€ํ™”, ์งˆ์˜ ์‘๋‹ต ๋“ฑ ๋” ๋‚˜์•„๊ฐ€ ์ž์œ  ํ˜•ํƒœ์˜ ์–ธ์–ด (Free-formed language)๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋”์šฑ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•˜๊ณ , ๋ชฉํ‘œ ์ž„๋ฌด์— ์ž˜ ๋Œ€์‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ฐœ์„ ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. ๋จผ์ €, ์‹œ๊ฐ์  ๋ณต์žก๋„๊ฐ€ ์ด๋ฏธ์ง€๋ณด๋‹ค ๋†’์€ ๋น„๋””์˜ค์™€ ๊ธด ๋ฌธ์žฅ ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ•๋“ค์„ ์†Œ๊ฐœํ•œ๋‹ค. ์ธ๊ฐ„์˜ ์ฃผ์˜ ์ธ์‹(Attention) ๋ชจ๋ธ์„ ๋น„๋””์˜ค-์–ธ์–ด ๋ชจ๋ธ์— ์ง€๋„ ํ•™์Šต ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ•˜๊ณ , ์ด์–ด์„œ ๋น„๋””์˜ค์—์„œ ์šฐ์„  ๊ฒ€์ถœ๋œ ๋Œ€ํ‘œ ์‹œ๊ฐ ๋‹จ์–ด๋ฅผ ๋งค๊ฐœ๋กœ ํ•˜์—ฌ ์ฃผ์˜ ์ธ์‹(Attention) ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋ณต์žก๋„๋ฅผ ๋”์šฑ ์ค„์ด๋Š” ์˜๋ฏธ ์ค‘์‹ฌ ์ฃผ์˜ ์ธ์‹ (Semantic Attention) ๋ฐฉ๋ฒ•, ์–ดํ…์…˜ ๋ชจ๋ธ์˜ ๋‹ค๋Œ€๋‹ค ๋งค์นญ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํšจ์œจ์ ์ธ ๋น„๋””์˜ค ๊ฒ€์ƒ‰ ๋ฐ ์งˆ์˜์‘๋‹ต์„ ๊ฐ€๋Šฅ์ผ€ ํ•˜๋Š” ๋น„๋””์˜ค-์–ธ์–ด๊ฐ„ ์œตํ•ฉ (Joint Sequence Fusion) ๋ฐฉ๋ฒ• ๋“ฑ ๋น„๋””์˜ค ์ฃผ์˜ ์ธ์‹์„ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•๋“ค์„ ์ œ์‹œํ•œ๋‹ค. ๋‹ค์Œ์œผ๋กœ๋Š”, ์ฃผ์˜ ์ธ์‹(Attention) ๋ชจ๋ธ์ด ๋ฌผ์ฒด-๋‹จ์–ด ๊ฐ„ ๊ด€๊ณ„๋ฅผ ๋„˜์–ด ๋น„๋””์˜ค ์ƒ์—์„œ ์ธ๋ฌผ ๊ฒ€์ƒ‰ (Person Searching) ๊ทธ๋ฆฌ๊ณ  ์ธ๋ฌผ ์žฌ ์‹๋ณ„ (Person Re-Identification)์„ ๋™์‹œ์— ์ˆ˜ํ–‰ํ•˜๋ฉฐ ์ƒ์Šน์ž‘์šฉ์„ ์ผ์œผํ‚ค๋Š” ์Šคํ† ๋ฆฌ ์† ์บ๋ฆญํ„ฐ ์ธ์‹ ์‹ ๊ฒฝ๋ง (Character in Story Identification Network) ์„ ์†Œ๊ฐœํ•˜๋ฉฐ, ๋งˆ์ง€๋ง‰์œผ๋กœ ์ž๊ธฐ ์ง€๋„ ํ•™์Šต(Self-supervised Learning)์„ ํ†ตํ•ด ์ฃผ์˜ ์ธ์‹(Attention) ๊ธฐ๋ฐ˜ ์–ธ์–ด ๋ชจ๋ธ์ด ๊ธด ๋น„๋””์˜ค์— ๋Œ€ํ•œ ์„ค๋ช…์„ ์—ฐ๊ด€์„ฑ ์žˆ๊ฒŒ ์ž˜ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋„๋ก ์œ ๋„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค. ์š”์•ฝํ•˜์ž๋ฉด, ์ด ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•œ ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•๋ก ๋“ค์€ ๋น„๋””์˜ค-์–ธ์–ด ํ•™์Šต์— ํ•ด๋‹นํ•˜๋Š” ๋น„๋””์˜ค ์บก์…˜(Video captioning), ๋น„๋””์˜ค ๊ฒ€์ƒ‰(Video Retrieval), ์‹œ๊ฐ ์งˆ์˜์‘๋‹ต(Video Question and Answering)๋“ฑ์„ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ์ˆ ์  ๋””๋”ค๋Œ์ด ๋˜๋ฉฐ, ๋น„๋””์˜ค ์บก์…˜ ํ•™์Šต์„ ํ†ตํ•ด ํ•™์Šต๋œ ์ฃผ์˜ ์ธ์‹ ๋ชจ๋“ˆ์€ ๊ฒ€์ƒ‰ ๋ฐ ์งˆ์˜์‘๋‹ต, ์ธ๋ฌผ ๊ฒ€์ƒ‰ ๋“ฑ ๊ฐ ๋„คํŠธ์›Œํฌ์— ์ด์‹๋˜๋ฉด์„œ ์ƒˆ๋กœ์šด ๋ฌธ์ œ๋“ค์— ๋Œ€ํ•ด ๋™์‹œ์— ์ตœ๊ณ  ์ˆ˜์ค€(State-of-the-art)์˜ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋น„๋””์˜ค-์–ธ์–ด ํ•™์Šต์œผ๋กœ ์–ป์€ ์–ธ์–ด ์ง€์‹์˜ ์ด์ „์€ ์‹œ๊ฐ-์ฒญ๊ฐ์„ ์•„์šฐ๋ฅด๋Š” ๋น„๋””์˜ค ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ•™์Šต์— ํฐ ๋„์›€์ด ๋˜๋Š” ๊ฒƒ์„ ์‹คํ—˜์ ์œผ๋กœ ๋ณด์—ฌ์ค€๋‹ค. ํ–ฅํ›„ ์ž‘์—… ๋ฐฉํ–ฅ (Future Work)์œผ๋กœ๋Š” ์•ž์„œ ์—ฐ๊ตฌํ•œ ๋‚ด์šฉ๋“ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์›น ์†์— ์กด์žฌํ•˜๋Š” ๋Œ€๊ทœ๋ชจ์˜ ์–ธ์–ด, ๋น„๋””์˜ค, ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ฉํ•ด ํ•™์Šต์— ํ™œ์šฉํ•˜์—ฌ ์‚ฐ์—…๊ณ„์˜ ๋งŽ์€ ๋‚œ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š” ๋น„์ง€๋„ ํ•™์Šต ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ ์ž ํ•œ๋‹ค.Chapter 1 Introduction 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 1.2 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . .8 Chapter 2 Related Work 2.1 Video Captioning . . . . . . . . . . . . . . . . . . . . . . . . . . .9 2.2 Video Retrieval with Natural Language . . . . . . . . . . . . . . 12 2.3 Video Question and Answering . . . . . . . . . . . . . . . . . . . 13 2.4 Cross-modal Representation Learning for Vision and LanguageTasks . . . . 15 Chapter 3 Human Attention Transfer for Video Captioning18 3.1 Introduction 3.2 Video Datasets for Caption and Gaze . . . . . . . . . . . . . . . 21 3.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.1 Video Pre-processing and Description . . . . . . . . . . . 22 3.3.2The Recurrent Gaze Prediction (RGP) Model . . . . . . . 23 3.3.3Construction of Visual Feature Pools . . . . . . . . . . . . 24 3.3.4The Decoder for Caption Generation . . . . . . . . . . . . 26 3.3.5Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.1Evaluation of Gaze Prediction . . . . . . . . . . . . . . . . 29 3.4.2Evaluation of Video Captioning . . . . . . . . . . . . . . . 32 3.4.3Human Evaluation via AMT . . . . . . . . . . . . . . . . 35 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Chapter 4 Semantic Word Attention for Video QA and VideoCaptioning 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1.1Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.2Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.1Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.2An Attention Model for Concept Detection . . . . . . . . 42 4.2.3Video-to-Language Models . . . . . . . . . . . . . . . . . 45 4.2.4A Model for Description . . . . . . . . . . . . . . . . . . . 45 4.2.5A Model for Fill-in-the-Blank . . . . . . . . . . . . . . . . 48 4.2.6A Model for Multiple-Choice Test . . . . . . . . . . . . . 50 4.2.7A Model for Retrieval . . . . . . . . . . . . . . . . . . . . 51 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.1The LSMDC Dataset and Tasks . . . . . . . . . . . . . . 52 4.3.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 54 4.3.3Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 56 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Chapter 5 Joint Sequnece Fusion Attention for Multimodal Sequence Data 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3.1Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3.2The Joint Semantic Tensor . . . . . . . . . . . . . . . . . 65 5.3.3The Convolutional Hierarchical Decoder . . . . . . . . . . 66 5.3.4An Illustrative Example of How the JSFusion Model Works 68 5.3.5Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3.6Implementation of Video-Language Models . . . . . . . . 69 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4.1LSMDC Dataset and Tasks . . . . . . . . . . . . . . . . . 71 5.4.2MSR-VTT-(RET/MC) Dataset and Tasks . . . . . . . . . 73 5.4.3Quantitative Results . . . . . . . . . . . . . . . . . . . . . 74 5.4.4Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 76 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Chapter 6 Character Re-Identification and Character Ground-ing for Movie Understanding 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.3.1Video Preprocessing . . . . . . . . . . . . . . . . . . . . . 84 6.3.2Visual Track Embedding . . . . . . . . . . . . . . . . . . . 85 6.3.3Textual Character Embedding . . . . . . . . . . . . . . . 86 6.3.4Character Grounding . . . . . . . . . . . . . . . . . . . . 87 6.3.5Re-Identification . . . . . . . . . . . . . . . . . . . . . . . 88 6.3.6Joint Training . . . . . . . . . . . . . . . . . . . . . . . . 90 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.4.1Experimental Setup . . . . . . . . . . . . . . . . . . . . . 92 6.4.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 93 6.4.3Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 95 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Chapter 7 Transitional Adaptation of Pretrained Models forVisual Storytelling 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.3.1The Visual Encoder . . . . . . . . . . . . . . . . . . . . . 104 7.3.2The Language Generator . . . . . . . . . . . . . . . . . . 104 7.3.3Adaptation training . . . . . . . . . . . . . . . . . . . . . 105 7.3.4The Sequential Coherence Loss . . . . . . . . . . . . . . . 105 7.3.5Training with the adaptation Loss . . . . . . . . . . . . . 107 7.3.6Fine-tuning and Inference . . . . . . . . . . . . . . . . . . 107 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.4.1Experimental Setup . . . . . . . . . . . . . . . . . . . . . 109 7.4.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 112 7.4.3Further Analyses . . . . . . . . . . . . . . . . . . . . . . . 112 7.4.4Human Evaluation Results . . . . . . . . . . . . . . . . . 115 7.4.5Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 116 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Chapter 8 Conclusion 8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 8.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Bibliography ... 123 ์š”์•ฝ ... 148 Acknowledgements ... 150Docto

    Deep Learning for Semantic Video Understanding

    Get PDF
    The field of computer vision has long strived to extract understanding from images and videos sequences. The recent flood of video data along with massive increments in computing power have provided the perfect environment to generate advanced research to extract intelligence from video data. Video data is ubiquitous, occurring in numerous everyday activities such as surveillance, traffic, movies, sports, etc. This massive amount of video needs to be analyzed and processed efficiently to extract semantic features towards video understanding. Such capabilities could benefit surveillance, video analytics and visually challenged people. While watching a long video, humans have the uncanny ability to bypass unnecessary information and concentrate on the important events. These key events can be used as a higher-level description or summary of a long video. Inspired by the human visual cortex, this research affords such abilities in computers using neural networks. Useful or interesting events are first extracted from a video and then deep learning methodologies are used to extract natural language summaries for each video sequence. Previous approaches of video description either have been domain specific or use a template based approach to fill detected objects such as verbs or actions to constitute a grammatically correct sentence. This work involves exploiting temporal contextual information for sentence generation while working on wide domain datasets. Current state-of- the-art video description methodologies are well suited for small video clips whereas this research can also be applied to long sequences of video. This work proposes methods to generate visual summaries of long videos, and in addition proposes techniques to annotate and generate textual summaries of the videos using recurrent networks. End to end video summarization immensely depends on abstractive summarization of video descriptions. State-of- the-art neural language & attention joint models have been used to generate textual summaries. Interesting segments of long video are extracted based on image quality as well as cinematographic and consumer preference. This novel approach will be a stepping stone for a variety of innovative applications such as video retrieval, automatic summarization for visually impaired persons, automatic movie review generation, video question and answering systems

    Deep Architectures for Visual Recognition and Description

    Get PDF
    In recent times, digital media contents are inherently of multimedia type, consisting of the form text, audio, image and video. Several of the outstanding computer Vision (CV) problems are being successfully solved with the help of modern Machine Learning (ML) techniques. Plenty of research work has already been carried out in the field of Automatic Image Annotation (AIA), Image Captioning and Video Tagging. Video Captioning, i.e., automatic description generation from digital video, however, is a different and complex problem altogether. This study compares various existing video captioning approaches available today and attempts their classification and analysis based on different parameters, viz., type of captioning methods (generation/retrieval), type of learning models employed, the desired output description length generated, etc. This dissertation also attempts to critically analyze the existing benchmark datasets used in various video captioning models and the evaluation metrics for assessing the final quality of the resultant video descriptions generated. A detailed study of important existing models, highlighting their comparative advantages as well as disadvantages are also included. In this study a novel approach for video captioning on the Microsoft Video Description (MSVD) dataset and Microsoft Video-to-Text (MSR-VTT) dataset is proposed using supervised learning techniques to train a deep combinational framework, for achieving better quality video captioning via predicting semantic tags. We develop simple shallow CNN (2D and 3D) as feature extractors, Deep Neural Networks (DNNs and Bidirectional LSTMs (BiLSTMs) as tag prediction models and Recurrent Neural Networks (RNNs) (LSTM) model as the language model. The aim of the work was to provide an alternative narrative to generating captions from videos via semantic tag predictions and deploy simpler shallower deep model architectures with lower memory requirements as solution so that it is not very memory extensive and the developed models prove to be stable and viable options when the scale of the data is increased. This study also successfully employed deep architectures like the Convolutional Neural Network (CNN) for speeding up automation process of hand gesture recognition and classification of the sign languages of the Indian classical dance form, โ€˜Bharatnatyamโ€™. This hand gesture classification is primarily aimed at 1) building a novel dataset of 2D single hand gestures belonging to 27 classes that were collected from (i) Google search engine (Google images), (ii) YouTube videos (dynamic and with background considered) and (iii) professional artists under staged environment constraints (plain backgrounds). 2) exploring the effectiveness of CNNs for identifying and classifying the single hand gestures by optimizing the hyperparameters, and 3) evaluating the impacts of transfer learning and double transfer learning, which is a novel concept explored for achieving higher classification accuracy
    • โ€ฆ
    corecore