246 research outputs found

    DiffusionVMR: Diffusion Model for Video Moment Retrieval

    Full text link
    Video moment retrieval is a fundamental visual-language task that aims to retrieve target moments from an untrimmed video based on a language query. Existing methods typically generate numerous proposals manually or via generative networks in advance as the support set for retrieval, which is not only inflexible but also time-consuming. Inspired by the success of diffusion models on object detection, this work aims at reformulating video moment retrieval as a denoising generation process to get rid of the inflexible and time-consuming proposal generation. To this end, we propose a novel proposal-free framework, namely DiffusionVMR, which directly samples random spans from noise as candidates and introduces denoising learning to ground target moments. During training, Gaussian noise is added to the real moments, and the model is trained to learn how to reverse this process. In inference, a set of time spans is progressively refined from the initial noise to the final output. Notably, the training and inference of DiffusionVMR are decoupled, and an arbitrary number of random spans can be used in inference without being consistent with the training phase. Extensive experiments conducted on three widely-used benchmarks (i.e., QVHighlight, Charades-STA, and TACoS) demonstrate the effectiveness of the proposed DiffusionVMR by comparing it with state-of-the-art methods

    Open-Vocabulary Object Detection via Scene Graph Discovery

    Full text link
    In recent years, open-vocabulary (OV) object detection has attracted increasing research attention. Unlike traditional detection, which only recognizes fixed-category objects, OV detection aims to detect objects in an open category set. Previous works often leverage vision-language (VL) training data (e.g., referring grounding data) to recognize OV objects. However, they only use pairs of nouns and individual objects in VL data, while these data usually contain much more information, such as scene graphs, which are also crucial for OV detection. In this paper, we propose a novel Scene-Graph-Based Discovery Network (SGDN) that exploits scene graph cues for OV detection. Firstly, a scene-graph-based decoder (SGDecoder) including sparse scene-graph-guided attention (SSGA) is presented. It captures scene graphs and leverages them to discover OV objects. Secondly, we propose scene-graph-based prediction (SGPred), where we build a scene-graph-based offset regression (SGOR) mechanism to enable mutual enhancement between scene graph extraction and object localization. Thirdly, we design a cross-modal learning mechanism in SGPred. It takes scene graphs as bridges to improve the consistency between cross-modal embeddings for OV object classification. Experiments on COCO and LVIS demonstrate the effectiveness of our approach. Moreover, we show the ability of our model for OV scene graph detection, while previous OV scene graph generation methods cannot tackle this task

    Less than Few: Self-Shot Video Instance Segmentation

    Get PDF
    The goal of this paper is to bypass the need for labelled examples in few-shot video understanding at run time. While proven effective, in many practical video settings even labelling a few examples appears unrealistic. This is especially true as the level of details in spatio-temporal video understanding and with it, the complexity of annotations continues to increase. Rather than performing few-shot learning with a human oracle to provide a few densely labelled support videos, we propose to automatically learn to find appropriate support videos given a query. We call this self-shot learning and we outline a simple self-supervised learning method to generate an embedding space well-suited for unsupervised retrieval of relevant samples. To showcase this novel setting, we tackle, for the first time, video instance segmentation in a self-shot (and few-shot) setting, where the goal is to segment instances at the pixel-level across the spatial and temporal domains. We provide strong baseline performances that utilize a novel transformer-based model and show that self-shot learning can even surpass few-shot and can be positively combined for further performance gains. Experiments on new benchmarks show that our approach achieves strong performance, is competitive to oracle support in some settings, scales to large unlabelled video collections, and can be combined in a semi-supervised setting.Comment: 25 pages, 5 figures, 13 table

    Benchmarking of Embedded Object Detection in Optical and RADAR Scenes

    Get PDF
    A portable, real-time vital sign estimation protoype is developed using neural network- based localization, multi-object tracking, and embedded processing optimizations. The system estimates heart and respiration rates of multiple subjects using directional of arrival techniques on RADAR data. This system is useful in many civilian and military applications including search and rescue. The primary contribution from this work is the implementation and benchmarking of neural networks for real time detection and localization on various systems including the testing of eight neural networks on a discrete GPU and Jetson Xavier devices. Mean average precision (mAP) and inference speed benchmarks were performed. We have shown fast and accurate detection and tracking using synthetic and real RADAR data. Another major contribution is the quantification of the relationship between neural network mAP performance and data augmentations. As an example, we focused on image and video compression methods, such as JPEG, WebP, H264, and H265. The results show WebP at a quantization level of 50 and H265 at a constant rate factor of 30 provide the best balance between compression and acceptable mAP. Other minor contributions are achieved in enhancing the functionality of the real-time prototype system. This includes the implementation and benchmarking of neural network op- timizations, such as quantization and pruning. Furthermore, an appearance-based synthetic RADAR and real RADAR datasets are developed. The latter contains simultaneous optical and RADAR data capture and cross-modal labels. Finally, multi-object tracking methods are benchmarked and a support vector machine is utilized for cross-modal association. In summary, the implementation, benchmarking, and optimization of methods for detection and tracking helped create a real-time vital sign system on a low-profile embedded device. Additionally, this work established a relationship between compression methods and different neural networks for optimal file compression and network performance. Finally, methods for RADAR and optical data collection and cross-modal association are implemented

    ์ด์•ผ๊ธฐํ˜• ์„ค๋ช…๋ฌธ์„ ํ™œ์šฉํ•œ ๋Œ€๊ทœ๋ชจ ๋น„๋””์˜ค ํ•™์Šต ์—ฐ๊ตฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021. 2. ๊น€๊ฑดํฌ.Extensive contributions are being made to develop intelligent agents that can recognize and communicate with the world. In this sense, various video-language tasks have drawn a lot of interests in computer vision research, including image/video captioning, video retrieval and video question answering. It can be applied to high-level computer vision tasks and various future industries such as search engines, social marketing, automated driving, and robotics support through QA / dialog generation for the surrounding environment. However, despite these developments, video-language learning suffers from a higher degree of complexity. This thesis investigates methodologies for learning the relationship between videos and free-formed languages, including explanations, conversations, and question-and-answers, so that the machine can easily adapt to target downstream tasks. First, we introduce several methods to learn the relationship between long sentences and videos efficiently. We introduce the approaches for supervising human attention transfer for the video attention model, which shows the video attention mechanism can benefit from explicit human gaze labels. Next, we introduce the end-to-end semantic attention method, which further reduces the visual attention algorithm's complexity by using the representative visual concept word detected by the attention-based detector. As a follow-up study on previous methods, we introduce a JSFusion (Joint Sequence Fusion) method that enables efficient video search and QA by enabling many-to-many matching of attention model. Next, we introduce the CiSIN(Character in Story Identification Network), which uses Attention to increase the performance of character grounding and character re-identification in the movie. Finally, we introduce Transitional Adaptation, which promotes the caption generation models to generates coherent narratives for long videos. In summary, this thesis presents a novel approaches for automatic video description generation/retrieval and shows the benefits of extracting linguistic knowledge for object and motion in the video as well as the advantage of multimodal audio-visual learning for understanding videos. Since the proposed methods are easily adapted to any video-language tasks, it is expected to be applied to the latest models, bringing additional performance improvements. Moving forward, we plan to design an unsupervised video learning framework that can solve many challenges in the industry by integrating an unlimited amount of video, audio, and free-formed language data from the web.์‹œ๊ฐ-์–ธ์–ด ํ•™์Šต์€ ์ด๋ฏธ์ง€/๋น„๋””์˜ค ์บก์…˜(Image/Video captioning), ์‹œ๊ฐ ์งˆ์˜์‘๋‹ต(Visual Question and Answering), ๋น„๋””์˜ค ๊ฒ€์ƒ‰(Video Retrieval), ์žฅ๋ฉด ์ดํ•ด(scene understanding), ์ด๋ฒคํŠธ ์ธ์‹(event detection) ๋“ฑ ๊ณ ์ฐจ์›์˜ ์ปดํ“จํ„ฐ ๋น„์ „ ํƒœ์Šคํฌ(task)๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ฃผ๋ณ€ ํ™˜๊ฒฝ์— ๋Œ€ํ•œ ์งˆ์˜ ์‘๋‹ต ๋ฐ ๋Œ€ํ™” ์ƒ์„ฑ(Dialogue Generation)์œผ๋กœ ์ธํ„ฐ๋„ท ๊ฒ€์ƒ‰ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ตœ๊ทผ ํ™œ๋ฐœํ•œ ์†Œ์…œ ๋งˆ์ผ€ํŒ…(Social Marketing) ์ž์œจ ์ฃผํ–‰(Automated Driving), ๋กœ๋ณดํ‹ฑ์Šค(Robotics)์„ ๋ณด์กฐํ•˜๋Š” ๋“ฑ ์—ฌ๋Ÿฌ ๋ฏธ๋ž˜ ์‚ฐ์—…์— ์ ์šฉ๋  ์ˆ˜ ์žˆ์–ด ํ™œ๋ฐœํžˆ ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ๋Š” ์ค‘์š”ํ•œ ๋ถ„์•ผ์ด๋‹ค. ์ปดํ“จํ„ฐ ๋น„์ ผ๊ณผ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ๋Š” ์ด๋Ÿฌํ•œ ์ค‘์š”์„ฑ์„ ๋ฐ”ํƒ•์œผ๋กœ ๊ฐ์ž ๊ณ ์œ ํ•œ ์˜์—ญ์—์„œ ๋ฐœ์ „์„ ๊ฑฐ๋“ญํ•ด ์™”์œผ๋‚˜, ์ตœ๊ทผ ๋”ฅ๋Ÿฌ๋‹์˜ ๋“ฑ์žฅ๊ณผ ํ•จ๊ป˜ ๋ˆˆ๋ถ€์‹œ๊ฒŒ ๋ฐœ์ „ํ•˜๋ฉด์„œ ์„œ๋กœ๋ฅผ ๋ณด์™„ํ•˜๋ฉฐ ํ•™์Šต ๊ฒฐ๊ณผ๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋“ฑ ํฐ ์‹œ๋„ˆ์ง€ ํšจ๊ณผ๋ฅผ ๋ฐœํœ˜ํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฐ ๋ฐœ์ „์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ๋น„๋””์˜ค-์–ธ์–ด๊ฐ„ ํ•™์Šต์€ ๋ฌธ์ œ์˜ ๋ณต์žก๋„๊ฐ€ ํ•œ์ธต ๋†’์•„ ์–ด๋ ค์›€์„ ๊ฒช๊ฒŒ ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋น„๋””์˜ค์™€ ์ด์— ๋Œ€์‘ํ•˜๋Š” ์„ค๋ช…, ๋Œ€ํ™”, ์งˆ์˜ ์‘๋‹ต ๋“ฑ ๋” ๋‚˜์•„๊ฐ€ ์ž์œ  ํ˜•ํƒœ์˜ ์–ธ์–ด (Free-formed language)๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋”์šฑ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•˜๊ณ , ๋ชฉํ‘œ ์ž„๋ฌด์— ์ž˜ ๋Œ€์‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ฐœ์„ ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. ๋จผ์ €, ์‹œ๊ฐ์  ๋ณต์žก๋„๊ฐ€ ์ด๋ฏธ์ง€๋ณด๋‹ค ๋†’์€ ๋น„๋””์˜ค์™€ ๊ธด ๋ฌธ์žฅ ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ•๋“ค์„ ์†Œ๊ฐœํ•œ๋‹ค. ์ธ๊ฐ„์˜ ์ฃผ์˜ ์ธ์‹(Attention) ๋ชจ๋ธ์„ ๋น„๋””์˜ค-์–ธ์–ด ๋ชจ๋ธ์— ์ง€๋„ ํ•™์Šต ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ•˜๊ณ , ์ด์–ด์„œ ๋น„๋””์˜ค์—์„œ ์šฐ์„  ๊ฒ€์ถœ๋œ ๋Œ€ํ‘œ ์‹œ๊ฐ ๋‹จ์–ด๋ฅผ ๋งค๊ฐœ๋กœ ํ•˜์—ฌ ์ฃผ์˜ ์ธ์‹(Attention) ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋ณต์žก๋„๋ฅผ ๋”์šฑ ์ค„์ด๋Š” ์˜๋ฏธ ์ค‘์‹ฌ ์ฃผ์˜ ์ธ์‹ (Semantic Attention) ๋ฐฉ๋ฒ•, ์–ดํ…์…˜ ๋ชจ๋ธ์˜ ๋‹ค๋Œ€๋‹ค ๋งค์นญ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํšจ์œจ์ ์ธ ๋น„๋””์˜ค ๊ฒ€์ƒ‰ ๋ฐ ์งˆ์˜์‘๋‹ต์„ ๊ฐ€๋Šฅ์ผ€ ํ•˜๋Š” ๋น„๋””์˜ค-์–ธ์–ด๊ฐ„ ์œตํ•ฉ (Joint Sequence Fusion) ๋ฐฉ๋ฒ• ๋“ฑ ๋น„๋””์˜ค ์ฃผ์˜ ์ธ์‹์„ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•๋“ค์„ ์ œ์‹œํ•œ๋‹ค. ๋‹ค์Œ์œผ๋กœ๋Š”, ์ฃผ์˜ ์ธ์‹(Attention) ๋ชจ๋ธ์ด ๋ฌผ์ฒด-๋‹จ์–ด ๊ฐ„ ๊ด€๊ณ„๋ฅผ ๋„˜์–ด ๋น„๋””์˜ค ์ƒ์—์„œ ์ธ๋ฌผ ๊ฒ€์ƒ‰ (Person Searching) ๊ทธ๋ฆฌ๊ณ  ์ธ๋ฌผ ์žฌ ์‹๋ณ„ (Person Re-Identification)์„ ๋™์‹œ์— ์ˆ˜ํ–‰ํ•˜๋ฉฐ ์ƒ์Šน์ž‘์šฉ์„ ์ผ์œผํ‚ค๋Š” ์Šคํ† ๋ฆฌ ์† ์บ๋ฆญํ„ฐ ์ธ์‹ ์‹ ๊ฒฝ๋ง (Character in Story Identification Network) ์„ ์†Œ๊ฐœํ•˜๋ฉฐ, ๋งˆ์ง€๋ง‰์œผ๋กœ ์ž๊ธฐ ์ง€๋„ ํ•™์Šต(Self-supervised Learning)์„ ํ†ตํ•ด ์ฃผ์˜ ์ธ์‹(Attention) ๊ธฐ๋ฐ˜ ์–ธ์–ด ๋ชจ๋ธ์ด ๊ธด ๋น„๋””์˜ค์— ๋Œ€ํ•œ ์„ค๋ช…์„ ์—ฐ๊ด€์„ฑ ์žˆ๊ฒŒ ์ž˜ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋„๋ก ์œ ๋„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค. ์š”์•ฝํ•˜์ž๋ฉด, ์ด ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•œ ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•๋ก ๋“ค์€ ๋น„๋””์˜ค-์–ธ์–ด ํ•™์Šต์— ํ•ด๋‹นํ•˜๋Š” ๋น„๋””์˜ค ์บก์…˜(Video captioning), ๋น„๋””์˜ค ๊ฒ€์ƒ‰(Video Retrieval), ์‹œ๊ฐ ์งˆ์˜์‘๋‹ต(Video Question and Answering)๋“ฑ์„ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ์ˆ ์  ๋””๋”ค๋Œ์ด ๋˜๋ฉฐ, ๋น„๋””์˜ค ์บก์…˜ ํ•™์Šต์„ ํ†ตํ•ด ํ•™์Šต๋œ ์ฃผ์˜ ์ธ์‹ ๋ชจ๋“ˆ์€ ๊ฒ€์ƒ‰ ๋ฐ ์งˆ์˜์‘๋‹ต, ์ธ๋ฌผ ๊ฒ€์ƒ‰ ๋“ฑ ๊ฐ ๋„คํŠธ์›Œํฌ์— ์ด์‹๋˜๋ฉด์„œ ์ƒˆ๋กœ์šด ๋ฌธ์ œ๋“ค์— ๋Œ€ํ•ด ๋™์‹œ์— ์ตœ๊ณ  ์ˆ˜์ค€(State-of-the-art)์˜ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋น„๋””์˜ค-์–ธ์–ด ํ•™์Šต์œผ๋กœ ์–ป์€ ์–ธ์–ด ์ง€์‹์˜ ์ด์ „์€ ์‹œ๊ฐ-์ฒญ๊ฐ์„ ์•„์šฐ๋ฅด๋Š” ๋น„๋””์˜ค ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ•™์Šต์— ํฐ ๋„์›€์ด ๋˜๋Š” ๊ฒƒ์„ ์‹คํ—˜์ ์œผ๋กœ ๋ณด์—ฌ์ค€๋‹ค. ํ–ฅํ›„ ์ž‘์—… ๋ฐฉํ–ฅ (Future Work)์œผ๋กœ๋Š” ์•ž์„œ ์—ฐ๊ตฌํ•œ ๋‚ด์šฉ๋“ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์›น ์†์— ์กด์žฌํ•˜๋Š” ๋Œ€๊ทœ๋ชจ์˜ ์–ธ์–ด, ๋น„๋””์˜ค, ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ฉํ•ด ํ•™์Šต์— ํ™œ์šฉํ•˜์—ฌ ์‚ฐ์—…๊ณ„์˜ ๋งŽ์€ ๋‚œ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š” ๋น„์ง€๋„ ํ•™์Šต ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ ์ž ํ•œ๋‹ค.Chapter 1 Introduction 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 1.2 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . .8 Chapter 2 Related Work 2.1 Video Captioning . . . . . . . . . . . . . . . . . . . . . . . . . . .9 2.2 Video Retrieval with Natural Language . . . . . . . . . . . . . . 12 2.3 Video Question and Answering . . . . . . . . . . . . . . . . . . . 13 2.4 Cross-modal Representation Learning for Vision and LanguageTasks . . . . 15 Chapter 3 Human Attention Transfer for Video Captioning18 3.1 Introduction 3.2 Video Datasets for Caption and Gaze . . . . . . . . . . . . . . . 21 3.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.1 Video Pre-processing and Description . . . . . . . . . . . 22 3.3.2The Recurrent Gaze Prediction (RGP) Model . . . . . . . 23 3.3.3Construction of Visual Feature Pools . . . . . . . . . . . . 24 3.3.4The Decoder for Caption Generation . . . . . . . . . . . . 26 3.3.5Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.1Evaluation of Gaze Prediction . . . . . . . . . . . . . . . . 29 3.4.2Evaluation of Video Captioning . . . . . . . . . . . . . . . 32 3.4.3Human Evaluation via AMT . . . . . . . . . . . . . . . . 35 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Chapter 4 Semantic Word Attention for Video QA and VideoCaptioning 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1.1Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.2Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.1Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.2An Attention Model for Concept Detection . . . . . . . . 42 4.2.3Video-to-Language Models . . . . . . . . . . . . . . . . . 45 4.2.4A Model for Description . . . . . . . . . . . . . . . . . . . 45 4.2.5A Model for Fill-in-the-Blank . . . . . . . . . . . . . . . . 48 4.2.6A Model for Multiple-Choice Test . . . . . . . . . . . . . 50 4.2.7A Model for Retrieval . . . . . . . . . . . . . . . . . . . . 51 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.1The LSMDC Dataset and Tasks . . . . . . . . . . . . . . 52 4.3.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 54 4.3.3Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 56 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Chapter 5 Joint Sequnece Fusion Attention for Multimodal Sequence Data 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3.1Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3.2The Joint Semantic Tensor . . . . . . . . . . . . . . . . . 65 5.3.3The Convolutional Hierarchical Decoder . . . . . . . . . . 66 5.3.4An Illustrative Example of How the JSFusion Model Works 68 5.3.5Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3.6Implementation of Video-Language Models . . . . . . . . 69 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4.1LSMDC Dataset and Tasks . . . . . . . . . . . . . . . . . 71 5.4.2MSR-VTT-(RET/MC) Dataset and Tasks . . . . . . . . . 73 5.4.3Quantitative Results . . . . . . . . . . . . . . . . . . . . . 74 5.4.4Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 76 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Chapter 6 Character Re-Identification and Character Ground-ing for Movie Understanding 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.3.1Video Preprocessing . . . . . . . . . . . . . . . . . . . . . 84 6.3.2Visual Track Embedding . . . . . . . . . . . . . . . . . . . 85 6.3.3Textual Character Embedding . . . . . . . . . . . . . . . 86 6.3.4Character Grounding . . . . . . . . . . . . . . . . . . . . 87 6.3.5Re-Identification . . . . . . . . . . . . . . . . . . . . . . . 88 6.3.6Joint Training . . . . . . . . . . . . . . . . . . . . . . . . 90 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.4.1Experimental Setup . . . . . . . . . . . . . . . . . . . . . 92 6.4.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 93 6.4.3Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 95 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Chapter 7 Transitional Adaptation of Pretrained Models forVisual Storytelling 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.3.1The Visual Encoder . . . . . . . . . . . . . . . . . . . . . 104 7.3.2The Language Generator . . . . . . . . . . . . . . . . . . 104 7.3.3Adaptation training . . . . . . . . . . . . . . . . . . . . . 105 7.3.4The Sequential Coherence Loss . . . . . . . . . . . . . . . 105 7.3.5Training with the adaptation Loss . . . . . . . . . . . . . 107 7.3.6Fine-tuning and Inference . . . . . . . . . . . . . . . . . . 107 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.4.1Experimental Setup . . . . . . . . . . . . . . . . . . . . . 109 7.4.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 112 7.4.3Further Analyses . . . . . . . . . . . . . . . . . . . . . . . 112 7.4.4Human Evaluation Results . . . . . . . . . . . . . . . . . 115 7.4.5Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 116 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Chapter 8 Conclusion 8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 8.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Bibliography ... 123 ์š”์•ฝ ... 148 Acknowledgements ... 150Docto

    MEDAVET: Traffic Vehicle Anomaly Detection Mechanism based on spatial and temporal structures in vehicle traffic

    Full text link
    Currently, there are computer vision systems that help us with tasks that would be dull for humans, such as surveillance and vehicle tracking. An important part of this analysis is to identify traffic anomalies. An anomaly tells us that something unusual has happened, in this case on the highway. This paper aims to model vehicle tracking using computer vision to detect traffic anomalies on a highway. We develop the steps of detection, tracking, and analysis of traffic: the detection of vehicles from video of urban traffic, the tracking of vehicles using a bipartite graph and the Convex Hull algorithm to delimit moving areas. Finally for anomaly detection we use two data structures to detect the beginning and end of the anomaly. The first is the QuadTree that groups vehicles that are stopped for a long time on the road and the second that approaches vehicles that are occluded. Experimental results show that our method is acceptable on the Track4 test set, with an F1 score of 85.7% and a mean squared error of 25.432.Comment: 14 pages, 14 figures, submitted to Journal of Internet Services and Applications - JIS

    Analysis of Using Metric Access Methods for Visual Search of Objects in Video Databases

    Get PDF
    This article presents an approach to object retrieval that searches for and localizes all the occurrences of an object in a video database, given a query image of the object. Our proposal is based on text-retrieval methods in which video key frames are represented by a dense set of viewpoint invariant region descriptors that enable recognition to proceed successfully despite changes in camera viewpoint, lighting, and partial occlusions. Vector quantizing these region descriptors provides a visual analogy of a word - a visual word. Those words are grouped into a visual vocabulary which is used to index all key frames from the video database. Efficient retrieval is then achieved by employing methods from statistical text retrieval, including inverted file systems, and text-document frequency weightings. Though works in the literature have only adopted a simple sequential scan during search, we investigate the use of different metric access methods (MAM): M-tree, Slim-tree, and D-index, in order to accelerate the processing of similarity queries. In addition, a ranking strategy based on the spatial layout of the regions (spatial consistency) is fully described and evaluated. Experimental results have shown that the adoption of MAMs not only has improved the search performance but also has reduced the influence of the vocabulary size over test results, which may improve the scalability of our proposal. Finally, the application of spatial consistency has produced a very significant improvement of the results

    Probabilistic Face Tracking From Location and Facial Identity Information

    Get PDF
    In recent times, the advancement in object detection has induced new video processing applications. This paper explores the development of the face tracking system with a probabilistic approach where face similarities and relative positions are considered. The potential applications and further enhancement includes detecting the human faces along with the behavior analysis and video surveillance at the chaotic situation such as objects overlapping. The system includes three stages - face detection and face alignment using MTCNN, face recognition using FaceNet, and face tracking after Gibbs sampling. I evaluate the full-temporal method and three other approaches - baseline, positional, and instantaneous methods
    • โ€ฆ
    corecore