246 research outputs found
DiffusionVMR: Diffusion Model for Video Moment Retrieval
Video moment retrieval is a fundamental visual-language task that aims to
retrieve target moments from an untrimmed video based on a language query.
Existing methods typically generate numerous proposals manually or via
generative networks in advance as the support set for retrieval, which is not
only inflexible but also time-consuming. Inspired by the success of diffusion
models on object detection, this work aims at reformulating video moment
retrieval as a denoising generation process to get rid of the inflexible and
time-consuming proposal generation. To this end, we propose a novel
proposal-free framework, namely DiffusionVMR, which directly samples random
spans from noise as candidates and introduces denoising learning to ground
target moments. During training, Gaussian noise is added to the real moments,
and the model is trained to learn how to reverse this process. In inference, a
set of time spans is progressively refined from the initial noise to the final
output. Notably, the training and inference of DiffusionVMR are decoupled, and
an arbitrary number of random spans can be used in inference without being
consistent with the training phase. Extensive experiments conducted on three
widely-used benchmarks (i.e., QVHighlight, Charades-STA, and TACoS) demonstrate
the effectiveness of the proposed DiffusionVMR by comparing it with
state-of-the-art methods
Open-Vocabulary Object Detection via Scene Graph Discovery
In recent years, open-vocabulary (OV) object detection has attracted
increasing research attention. Unlike traditional detection, which only
recognizes fixed-category objects, OV detection aims to detect objects in an
open category set. Previous works often leverage vision-language (VL) training
data (e.g., referring grounding data) to recognize OV objects. However, they
only use pairs of nouns and individual objects in VL data, while these data
usually contain much more information, such as scene graphs, which are also
crucial for OV detection. In this paper, we propose a novel Scene-Graph-Based
Discovery Network (SGDN) that exploits scene graph cues for OV detection.
Firstly, a scene-graph-based decoder (SGDecoder) including sparse
scene-graph-guided attention (SSGA) is presented. It captures scene graphs and
leverages them to discover OV objects. Secondly, we propose scene-graph-based
prediction (SGPred), where we build a scene-graph-based offset regression
(SGOR) mechanism to enable mutual enhancement between scene graph extraction
and object localization. Thirdly, we design a cross-modal learning mechanism in
SGPred. It takes scene graphs as bridges to improve the consistency between
cross-modal embeddings for OV object classification. Experiments on COCO and
LVIS demonstrate the effectiveness of our approach. Moreover, we show the
ability of our model for OV scene graph detection, while previous OV scene
graph generation methods cannot tackle this task
Less than Few: Self-Shot Video Instance Segmentation
The goal of this paper is to bypass the need for labelled examples in
few-shot video understanding at run time. While proven effective, in many
practical video settings even labelling a few examples appears unrealistic.
This is especially true as the level of details in spatio-temporal video
understanding and with it, the complexity of annotations continues to increase.
Rather than performing few-shot learning with a human oracle to provide a few
densely labelled support videos, we propose to automatically learn to find
appropriate support videos given a query. We call this self-shot learning and
we outline a simple self-supervised learning method to generate an embedding
space well-suited for unsupervised retrieval of relevant samples. To showcase
this novel setting, we tackle, for the first time, video instance segmentation
in a self-shot (and few-shot) setting, where the goal is to segment instances
at the pixel-level across the spatial and temporal domains. We provide strong
baseline performances that utilize a novel transformer-based model and show
that self-shot learning can even surpass few-shot and can be positively
combined for further performance gains. Experiments on new benchmarks show that
our approach achieves strong performance, is competitive to oracle support in
some settings, scales to large unlabelled video collections, and can be
combined in a semi-supervised setting.Comment: 25 pages, 5 figures, 13 table
Benchmarking of Embedded Object Detection in Optical and RADAR Scenes
A portable, real-time vital sign estimation protoype is developed using neural network- based localization, multi-object tracking, and embedded processing optimizations. The system estimates heart and respiration rates of multiple subjects using directional of arrival techniques on RADAR data. This system is useful in many civilian and military applications including search and rescue.
The primary contribution from this work is the implementation and benchmarking of neural networks for real time detection and localization on various systems including the testing of eight neural networks on a discrete GPU and Jetson Xavier devices. Mean average precision (mAP) and inference speed benchmarks were performed. We have shown fast and accurate detection and tracking using synthetic and real RADAR data.
Another major contribution is the quantification of the relationship between neural network mAP performance and data augmentations. As an example, we focused on image and video compression methods, such as JPEG, WebP, H264, and H265. The results show WebP at a quantization level of 50 and H265 at a constant rate factor of 30 provide the best balance between compression and acceptable mAP.
Other minor contributions are achieved in enhancing the functionality of the real-time prototype system. This includes the implementation and benchmarking of neural network op- timizations, such as quantization and pruning. Furthermore, an appearance-based synthetic RADAR and real RADAR datasets are developed. The latter contains simultaneous optical and RADAR data capture and cross-modal labels. Finally, multi-object tracking methods are benchmarked and a support vector machine is utilized for cross-modal association.
In summary, the implementation, benchmarking, and optimization of methods for detection and tracking helped create a real-time vital sign system on a low-profile embedded device. Additionally, this work established a relationship between compression methods and different neural networks for optimal file compression and network performance. Finally, methods for RADAR and optical data collection and cross-modal association are implemented
์ด์ผ๊ธฐํ ์ค๋ช ๋ฌธ์ ํ์ฉํ ๋๊ท๋ชจ ๋น๋์ค ํ์ต ์ฐ๊ตฌ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ปดํจํฐ๊ณตํ๋ถ, 2021. 2. ๊น๊ฑดํฌ.Extensive contributions are being made to develop intelligent agents that can recognize and communicate with the world. In this sense, various video-language tasks have drawn a lot of interests in computer vision research, including image/video captioning, video retrieval and video question answering.
It can be applied to high-level computer vision tasks and various future industries such as search engines, social marketing, automated driving, and robotics support through QA / dialog generation for the surrounding environment.
However, despite these developments, video-language learning suffers from a higher degree of complexity.
This thesis investigates methodologies for learning the relationship between videos and free-formed languages, including explanations, conversations, and question-and-answers, so that the machine can easily adapt to target downstream tasks.
First, we introduce several methods to learn the relationship between long sentences and videos efficiently. We introduce the approaches for supervising human attention transfer for the video attention model, which shows the video attention mechanism can benefit from explicit human gaze labels. Next, we introduce the end-to-end semantic attention method, which further reduces the visual attention algorithm's complexity by using the representative visual concept word detected by the attention-based detector. As a follow-up study on previous methods, we introduce a JSFusion (Joint Sequence Fusion) method that enables efficient video search and QA by enabling many-to-many matching of attention model.
Next, we introduce the CiSIN(Character in Story Identification Network), which uses Attention to increase the performance of character grounding and character re-identification in the movie. Finally, we introduce Transitional Adaptation, which promotes the caption generation models to generates coherent narratives for long videos.
In summary, this thesis presents a novel approaches for automatic video description generation/retrieval and shows the benefits of extracting linguistic knowledge for object and motion in the video as well as the advantage of multimodal audio-visual learning for understanding videos. Since the proposed methods are easily adapted to any video-language tasks, it is expected to be applied to the latest models, bringing additional performance improvements.
Moving forward, we plan to design an unsupervised video learning framework that can solve many challenges in the industry by integrating an unlimited amount of video, audio, and free-formed language data from the web.์๊ฐ-์ธ์ด ํ์ต์ ์ด๋ฏธ์ง/๋น๋์ค ์บก์
(Image/Video captioning), ์๊ฐ ์ง์์๋ต(Visual Question and Answering), ๋น๋์ค ๊ฒ์(Video Retrieval), ์ฅ๋ฉด ์ดํด(scene understanding), ์ด๋ฒคํธ ์ธ์(event detection) ๋ฑ ๊ณ ์ฐจ์์ ์ปดํจํฐ ๋น์ ํ์คํฌ(task)๋ฟ๋ง ์๋๋ผ ์ฃผ๋ณ ํ๊ฒฝ์ ๋ํ ์ง์ ์๋ต ๋ฐ ๋ํ ์์ฑ(Dialogue Generation)์ผ๋ก ์ธํฐ๋ท ๊ฒ์ ๋ฟ๋ง ์๋๋ผ ์ต๊ทผ ํ๋ฐํ ์์
๋ง์ผํ
(Social Marketing) ์์จ ์ฃผํ(Automated Driving), ๋ก๋ณดํฑ์ค(Robotics)์ ๋ณด์กฐํ๋ ๋ฑ ์ฌ๋ฌ ๋ฏธ๋ ์ฐ์
์ ์ ์ฉ๋ ์ ์์ด ํ๋ฐํ ์ฐ๊ตฌ๋๊ณ ์๋ ์ค์ํ ๋ถ์ผ์ด๋ค.
์ปดํจํฐ ๋น์ ผ๊ณผ ์์ฐ์ด ์ฒ๋ฆฌ๋ ์ด๋ฌํ ์ค์์ฑ์ ๋ฐํ์ผ๋ก ๊ฐ์ ๊ณ ์ ํ ์์ญ์์ ๋ฐ์ ์ ๊ฑฐ๋ญํด ์์ผ๋, ์ต๊ทผ ๋ฅ๋ฌ๋์ ๋ฑ์ฅ๊ณผ ํจ๊ป ๋๋ถ์๊ฒ ๋ฐ์ ํ๋ฉด์ ์๋ก๋ฅผ ๋ณด์ํ๋ฉฐ ํ์ต ๊ฒฐ๊ณผ๋ฅผ ํฅ์์ํค๋ ๋ฑ ํฐ ์๋์ง ํจ๊ณผ๋ฅผ ๋ฐํํ๊ฒ ๋์๋ค.
ํ์ง๋ง ์ด๋ฐ ๋ฐ์ ์๋ ๋ถ๊ตฌํ๊ณ , ๋น๋์ค-์ธ์ด๊ฐ ํ์ต์ ๋ฌธ์ ์ ๋ณต์ก๋๊ฐ ํ์ธต ๋์ ์ด๋ ค์์ ๊ฒช๊ฒ ๋๋ ๊ฒฝ์ฐ๊ฐ ๋ง๋ค.
๋ณธ ๋
ผ๋ฌธ์์๋ ๋น๋์ค์ ์ด์ ๋์ํ๋ ์ค๋ช
, ๋ํ, ์ง์ ์๋ต ๋ฑ ๋ ๋์๊ฐ ์์ ํํ์ ์ธ์ด (Free-formed language)๊ฐ์ ๊ด๊ณ๋ฅผ ๋์ฑ ํจ์จ์ ์ผ๋ก ํ์ตํ๊ณ , ๋ชฉํ ์๋ฌด์ ์ ๋์ํ ์ ์๋๋ก ๊ฐ์ ํ๋ ๊ฒ์ ๋ชฉํ๋ก ํ๋ค.
๋จผ์ , ์๊ฐ์ ๋ณต์ก๋๊ฐ ์ด๋ฏธ์ง๋ณด๋ค ๋์ ๋น๋์ค์ ๊ธด ๋ฌธ์ฅ ์ฌ์ด์ ๊ด๊ณ๋ฅผ ํจ์จ์ ์ผ๋ก ํ์ตํ๊ธฐ ์ํ ์ฌ๋ฌ ๋ฐฉ๋ฒ๋ค์ ์๊ฐํ๋ค. ์ธ๊ฐ์ ์ฃผ์ ์ธ์(Attention) ๋ชจ๋ธ์ ๋น๋์ค-์ธ์ด ๋ชจ๋ธ์ ์ง๋ ํ์ต ํ๋ ๋ฐฉ๋ฒ์ ์๊ฐํ๊ณ , ์ด์ด์ ๋น๋์ค์์ ์ฐ์ ๊ฒ์ถ๋ ๋ํ ์๊ฐ ๋จ์ด๋ฅผ ๋งค๊ฐ๋ก ํ์ฌ ์ฃผ์ ์ธ์(Attention) ์๊ณ ๋ฆฌ์ฆ์ ๋ณต์ก๋๋ฅผ ๋์ฑ ์ค์ด๋ ์๋ฏธ ์ค์ฌ ์ฃผ์ ์ธ์ (Semantic Attention) ๋ฐฉ๋ฒ, ์ดํ
์
๋ชจ๋ธ์ ๋ค๋๋ค ๋งค์นญ์ ๊ธฐ๋ฐ์ผ๋ก ํจ์จ์ ์ธ ๋น๋์ค ๊ฒ์ ๋ฐ ์ง์์๋ต์ ๊ฐ๋ฅ์ผ ํ๋ ๋น๋์ค-์ธ์ด๊ฐ ์ตํฉ (Joint Sequence Fusion) ๋ฐฉ๋ฒ ๋ฑ ๋น๋์ค ์ฃผ์ ์ธ์์ ํจ์จ์ ์ผ๋ก ํ์ต์ํฌ ์ ์๋ ๋ฐฉ๋ฒ๋ค์ ์ ์ํ๋ค.
๋ค์์ผ๋ก๋, ์ฃผ์ ์ธ์(Attention) ๋ชจ๋ธ์ด ๋ฌผ์ฒด-๋จ์ด ๊ฐ ๊ด๊ณ๋ฅผ ๋์ด ๋น๋์ค ์์์ ์ธ๋ฌผ ๊ฒ์ (Person Searching) ๊ทธ๋ฆฌ๊ณ ์ธ๋ฌผ ์ฌ ์๋ณ (Person Re-Identification)์ ๋์์ ์ํํ๋ฉฐ ์์น์์ฉ์ ์ผ์ผํค๋ ์คํ ๋ฆฌ ์ ์บ๋ฆญํฐ ์ธ์ ์ ๊ฒฝ๋ง (Character in Story Identification Network) ์ ์๊ฐํ๋ฉฐ, ๋ง์ง๋ง์ผ๋ก ์๊ธฐ ์ง๋ ํ์ต(Self-supervised Learning)์ ํตํด ์ฃผ์ ์ธ์(Attention) ๊ธฐ๋ฐ ์ธ์ด ๋ชจ๋ธ์ด ๊ธด ๋น๋์ค์ ๋ํ ์ค๋ช
์ ์ฐ๊ด์ฑ ์๊ฒ ์ ์์ฑํ ์ ์๋๋ก ์ ๋ํ๋ ๋ฐฉ๋ฒ์ ์๊ฐํ๋ค.
์์ฝํ์๋ฉด, ์ด ํ์ ๋
ผ๋ฌธ์์ ์ ์ํ ์๋ก์ด ๋ฐฉ๋ฒ๋ก ๋ค์ ๋น๋์ค-์ธ์ด ํ์ต์ ํด๋นํ๋ ๋น๋์ค ์บก์
(Video captioning), ๋น๋์ค ๊ฒ์(Video Retrieval), ์๊ฐ ์ง์์๋ต(Video Question and Answering)๋ฑ์ ํด๊ฒฐํ ์ ์๋ ๊ธฐ์ ์ ๋๋ค๋์ด ๋๋ฉฐ, ๋น๋์ค ์บก์
ํ์ต์ ํตํด ํ์ต๋ ์ฃผ์ ์ธ์ ๋ชจ๋์ ๊ฒ์ ๋ฐ ์ง์์๋ต, ์ธ๋ฌผ ๊ฒ์ ๋ฑ ๊ฐ ๋คํธ์ํฌ์ ์ด์๋๋ฉด์ ์๋ก์ด ๋ฌธ์ ๋ค์ ๋ํด ๋์์ ์ต๊ณ ์์ค(State-of-the-art)์ ์ฑ๋ฅ์ ๋ฌ์ฑํ์๋ค. ์ด๋ฅผ ํตํด ๋น๋์ค-์ธ์ด ํ์ต์ผ๋ก ์ป์ ์ธ์ด ์ง์์ ์ด์ ์ ์๊ฐ-์ฒญ๊ฐ์ ์์ฐ๋ฅด๋ ๋น๋์ค ๋ฉํฐ๋ชจ๋ฌ ํ์ต์ ํฐ ๋์์ด ๋๋ ๊ฒ์ ์คํ์ ์ผ๋ก ๋ณด์ฌ์ค๋ค. ํฅํ ์์
๋ฐฉํฅ (Future Work)์ผ๋ก๋ ์์ ์ฐ๊ตฌํ ๋ด์ฉ๋ค์ ๊ธฐ๋ฐ์ผ๋ก ์น ์์ ์กด์ฌํ๋ ๋๊ท๋ชจ์ ์ธ์ด, ๋น๋์ค, ์ค๋์ค ๋ฐ์ดํฐ๋ฅผ ํตํฉํด ํ์ต์ ํ์ฉํ์ฌ ์ฐ์
๊ณ์ ๋ง์ ๋์ ๋ฅผ ํด๊ฒฐํ ์ ์๋ ๋น์ง๋ ํ์ต ๋ชจ๋ธ์ ๋ง๋ค๊ณ ์ ํ๋ค.Chapter 1
Introduction
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4
1.2 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . .8
Chapter 2
Related Work
2.1 Video Captioning . . . . . . . . . . . . . . . . . . . . . . . . . . .9
2.2 Video Retrieval with Natural Language . . . . . . . . . . . . . . 12
2.3 Video Question and Answering . . . . . . . . . . . . . . . . . . . 13
2.4 Cross-modal Representation Learning for Vision and LanguageTasks . . . . 15
Chapter 3 Human Attention Transfer for Video Captioning18
3.1 Introduction
3.2 Video Datasets for Caption and Gaze . . . . . . . . . . . . . . . 21
3.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Video Pre-processing and Description . . . . . . . . . . . 22
3.3.2The Recurrent Gaze Prediction (RGP) Model . . . . . . . 23
3.3.3Construction of Visual Feature Pools . . . . . . . . . . . . 24
3.3.4The Decoder for Caption Generation . . . . . . . . . . . . 26
3.3.5Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.1Evaluation of Gaze Prediction . . . . . . . . . . . . . . . . 29
3.4.2Evaluation of Video Captioning . . . . . . . . . . . . . . . 32
3.4.3Human Evaluation via AMT . . . . . . . . . . . . . . . . 35
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Chapter 4 Semantic Word Attention for Video QA and VideoCaptioning
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.1Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.2Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.1Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.2An Attention Model for Concept Detection . . . . . . . . 42
4.2.3Video-to-Language Models . . . . . . . . . . . . . . . . . 45
4.2.4A Model for Description . . . . . . . . . . . . . . . . . . . 45
4.2.5A Model for Fill-in-the-Blank . . . . . . . . . . . . . . . . 48
4.2.6A Model for Multiple-Choice Test . . . . . . . . . . . . . 50
4.2.7A Model for Retrieval . . . . . . . . . . . . . . . . . . . . 51
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.1The LSMDC Dataset and Tasks . . . . . . . . . . . . . . 52
4.3.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 54
4.3.3Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Chapter 5 Joint Sequnece Fusion Attention for Multimodal Sequence Data
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.1Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.2The Joint Semantic Tensor . . . . . . . . . . . . . . . . . 65
5.3.3The Convolutional Hierarchical Decoder . . . . . . . . . . 66
5.3.4An Illustrative Example of How the JSFusion Model Works 68
5.3.5Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3.6Implementation of Video-Language Models . . . . . . . . 69
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.1LSMDC Dataset and Tasks . . . . . . . . . . . . . . . . . 71
5.4.2MSR-VTT-(RET/MC) Dataset and Tasks . . . . . . . . . 73
5.4.3Quantitative Results . . . . . . . . . . . . . . . . . . . . . 74
5.4.4Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 76
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Chapter 6 Character Re-Identification and Character Ground-ing for Movie Understanding
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3.1Video Preprocessing . . . . . . . . . . . . . . . . . . . . . 84
6.3.2Visual Track Embedding . . . . . . . . . . . . . . . . . . . 85
6.3.3Textual Character Embedding . . . . . . . . . . . . . . . 86
6.3.4Character Grounding . . . . . . . . . . . . . . . . . . . . 87
6.3.5Re-Identification . . . . . . . . . . . . . . . . . . . . . . . 88
6.3.6Joint Training . . . . . . . . . . . . . . . . . . . . . . . . 90
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.4.1Experimental Setup . . . . . . . . . . . . . . . . . . . . . 92
6.4.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 93
6.4.3Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 95
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Chapter 7 Transitional Adaptation of Pretrained Models forVisual Storytelling
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.3.1The Visual Encoder . . . . . . . . . . . . . . . . . . . . . 104
7.3.2The Language Generator . . . . . . . . . . . . . . . . . . 104
7.3.3Adaptation training . . . . . . . . . . . . . . . . . . . . . 105
7.3.4The Sequential Coherence Loss . . . . . . . . . . . . . . . 105
7.3.5Training with the adaptation Loss . . . . . . . . . . . . . 107
7.3.6Fine-tuning and Inference . . . . . . . . . . . . . . . . . . 107
7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.4.1Experimental Setup . . . . . . . . . . . . . . . . . . . . . 109
7.4.2Quantitative Results . . . . . . . . . . . . . . . . . . . . . 112
7.4.3Further Analyses . . . . . . . . . . . . . . . . . . . . . . . 112
7.4.4Human Evaluation Results . . . . . . . . . . . . . . . . . 115
7.4.5Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 116
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Chapter 8 Conclusion
8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Bibliography ... 123
์์ฝ ... 148
Acknowledgements ... 150Docto
MEDAVET: Traffic Vehicle Anomaly Detection Mechanism based on spatial and temporal structures in vehicle traffic
Currently, there are computer vision systems that help us with tasks that
would be dull for humans, such as surveillance and vehicle tracking. An
important part of this analysis is to identify traffic anomalies. An anomaly
tells us that something unusual has happened, in this case on the highway. This
paper aims to model vehicle tracking using computer vision to detect traffic
anomalies on a highway. We develop the steps of detection, tracking, and
analysis of traffic: the detection of vehicles from video of urban traffic, the
tracking of vehicles using a bipartite graph and the Convex Hull algorithm to
delimit moving areas. Finally for anomaly detection we use two data structures
to detect the beginning and end of the anomaly. The first is the QuadTree that
groups vehicles that are stopped for a long time on the road and the second
that approaches vehicles that are occluded. Experimental results show that our
method is acceptable on the Track4 test set, with an F1 score of 85.7% and a
mean squared error of 25.432.Comment: 14 pages, 14 figures, submitted to Journal of Internet Services and
Applications - JIS
Analysis of Using Metric Access Methods for Visual Search of Objects in Video Databases
This article presents an approach to object retrieval that searches for and localizes all the occurrences of an object in a video database, given a query image of the object. Our proposal is based on text-retrieval methods in which video key frames are represented by a dense set of viewpoint invariant region descriptors that enable recognition to proceed successfully despite changes in camera viewpoint, lighting, and partial occlusions. Vector quantizing these region descriptors provides a visual analogy of a word - a visual word. Those words are grouped into a visual vocabulary which is used to index all key frames from the video database. Efficient retrieval is then achieved by employing methods from statistical text retrieval, including inverted file systems, and text-document frequency weightings. Though works in the literature have only adopted a simple sequential scan during search, we investigate the use of different metric access methods (MAM): M-tree, Slim-tree, and D-index, in order to accelerate the processing of similarity queries. In addition, a ranking strategy based on the spatial layout of the regions (spatial consistency) is fully described and evaluated. Experimental results have shown that the adoption of MAMs not only has improved the search performance but also has reduced the influence of the vocabulary size over test results, which may improve the scalability of our proposal. Finally, the application of spatial consistency has produced a very significant improvement of the results
Probabilistic Face Tracking From Location and Facial Identity Information
In recent times, the advancement in object detection has induced new video processing applications. This paper explores the development of the face tracking system with a probabilistic approach where face similarities and relative positions are considered. The potential applications and further enhancement includes detecting the human faces along with the behavior analysis and video surveillance at the chaotic situation such as objects overlapping. The system includes three stages - face detection and face alignment using MTCNN, face recognition using FaceNet, and face tracking after Gibbs sampling. I evaluate the full-temporal method and three other approaches - baseline, positional, and instantaneous methods
- โฆ