13,052 research outputs found
Spatio-temporal Video Re-localization by Warp LSTM
The need for efficiently finding the video content a user wants is increasing
because of the erupting of user-generated videos on the Web. Existing
keyword-based or content-based video retrieval methods usually determine what
occurs in a video but not when and where. In this paper, we make an answer to
the question of when and where by formulating a new task, namely
spatio-temporal video re-localization. Specifically, given a query video and a
reference video, spatio-temporal video re-localization aims to localize
tubelets in the reference video such that the tubelets semantically correspond
to the query. To accurately localize the desired tubelets in the reference
video, we propose a novel warp LSTM network, which propagates the
spatio-temporal information for a long period and thereby captures the
corresponding long-term dependencies. Another issue for spatio-temporal video
re-localization is the lack of properly labeled video datasets. Therefore, we
reorganize the videos in the AVA dataset to form a new dataset for
spatio-temporal video re-localization research. Extensive experimental results
show that the proposed model achieves superior performances over the designed
baselines on the spatio-temporal video re-localization task
Distributed video coding for wireless video sensor networks: a review of the state-of-the-art architectures
Distributed video coding (DVC) is a relatively new video coding architecture originated from two fundamental theorems namely, SlepianβWolf and WynerβZiv. Recent research developments have made DVC attractive for applications in the emerging domain of wireless video sensor networks (WVSNs). This paper reviews the state-of-the-art DVC architectures with a focus on understanding their opportunities and gaps in addressing the operational requirements and application needs of WVSNs
Color-decoupled photo response non-uniformity for digital image forensics
The last few years have seen the use of photo response non-uniformity noise (PRNU), a unique fingerprint of imaging sensors, in various digital forensic applications such as source device identification, content integrity verification and authentication. However, the use of a colour filter array for capturing only one of the three colour components per pixel introduces colour interpolation noise, while the existing methods for extracting PRNU provide no effective means for addressing this issue. Because the artificial colours obtained through the colour interpolation process is not directly acquired from the scene by physical hardware, we expect that the PRNU extracted from the physical components, which are free from interpolation noise, should be more reliable than that from the artificial channels, which carry interpolation noise. Based on this assumption we propose a Couple-Decoupled PRNU (CD-PRNU) extraction method, which first decomposes each colour channel into 4 sub-images and then extracts the PRNU noise from each sub-image. The PRNU noise patterns of the sub-images are then assembled to get the CD-PRNU. This new method can prevent the interpolation noise from propagating into the physical components, thus improving the accuracy of device identification and image content integrity verification
λΉλμ€ νλ μ 보κ°μ μν ν μ€νΈ λ¨κ³μ μ μμ λ°©λ²λ‘ μ°κ΅¬
νμλ
Όλ¬Έ (λ°μ¬) -- μμΈλνκ΅ λνμ : 곡과λν μ κΈ°Β·μ»΄ν¨ν°κ³΅νλΆ, 2021. 2. μ΄κ²½λ¬΄.Computationally handling videos has been one of the foremost goals in computer vision. In particular, analyzing the complex dynamics including motion and occlusion between two frames is of fundamental importance in understanding the visual contents of a video. Research on video frame interpolation, a problem where the goal is to synthesize high-quality intermediate frames between the two input frames, specifically investigates the low-level characteristics within the consecutive frames of a video. The topic has been recently gaining increased popularity and can be applied to various real-world applications such as generating slow-motion effects, novel view synthesis, or video stabilization. Existing methods for video frame interpolation aim to design complex new architectures to effectively estimate and compensate for the motion between two input frames. However, natural videos contain a wide variety of different scenarios, including foreground/background appearance and motion, frame rate, and occlusion. Therefore, even with a huge amount of training data, it is difficult for a single model to generalize well on all possible situations.
This dissertation introduces novel methodologies for test-time adaptation for tackling the problem of video frame interpolation. In particular, I propose to enable three different aspects of the deep-learning-based framework to be adaptive: (1) feature activation, (2) network weights, and (3) architectural structures. Specifically, I first present how adaptively scaling the feature activations of a deep neural network with respect to each input frame using attention models allows for accurate interpolation. Unlike the previous approaches that heavily depend on optical flow estimation models, the proposed channel-attention-based model can achieve high-quality frame synthesis without explicit motion estimation. Then, meta-learning is employed for fast adaptation of the parameter values of the frame interpolation models. By learning to adapt for each input video clips, the proposed framework can consistently improve the performance of many existing models with just a single gradient update to its parameters. Lastly, I introduce an input-adaptive dynamic architecture that can assign different inference paths with respect to each local region of the input frames. By deciding the scaling factors of the inputs and the network depth of the early exit in the interpolation model, the dynamic framework can greatly improve the computational efficiency while maintaining, and sometimes even outperforming the performance of the baseline interpolation method.
The effectiveness of the proposed test-time adaptation methodologies is extensively evaluated with multiple benchmark datasets for video frame interpolation. Thorough ablation studies with various hyperparameter settings and baseline networks also demonstrate the superiority of adaptation to the test-time inputs, which is a new research direction orthogonal to the other state-of-the-art frame interpolation approaches.κ³μ°μ μΌλ‘ λΉλμ€ λ°μ΄ν°λ₯Ό μ²λ¦¬νλ κ²μ μ»΄ν¨ν° λΉμ λΆμΌμ μ€μν λͺ©ν μ€ νλμ΄κ³ , μ΄λ₯Ό μν΄μ λ λΉλμ€ νλ μ μ¬μ΄μ μμ§μκ³Ό κ°λ¦¬μ΄μ§ λ±μ 볡μ‘ν μ 보λ₯Ό λΆμνλ κ²μ΄ νμμ μ΄λ€. λΉλμ€ νλ μ 보κ°λ²μ λ μ
λ ₯ νλ μ μ¬μ΄μ μ€κ° νλ μμ μ ννκ² μμ±νλ κ²μ λͺ©νλ‘ νλ λ¬Έμ λ‘, μ°μλ λΉλμ€ νλ μ μ¬μ΄μ μ λ°ν (νμ λ¨μμ) νΉμ§λ€μ μμ§μκ³Ό κ°λ¦¬μ΄μ§μ κ³ λ €νμ¬ λΆμνλλ‘ μ°κ΅¬λμλ€. μ΄ λΆμΌλ μ¬λ‘μ°λͺ¨μ
ν¨κ³Ό μμ±, λ€λ₯Έ μμ μμ λ°λΌλ³΄λ 물체 μμ±, μλ¨λ¦Ό 보μ λ± μ€μνμ λ€μν μ΄ν리μΌμ΄μ
μ νμ©λ μ μκΈ° λλ¬Έμ μ΅κ·Όμ λ§μ κ΄μ¬μ λ°κ³ μλ€. κΈ°μ‘΄μ λ°©λ²λ€μ λ μ
λ ₯ νλ μ μ¬μ΄μ ν½μ
λ¨μ μμ§μ μ 보λ₯Ό ν¨κ³Όμ μΌλ‘ μμΈ‘νκ³ λ³΄μνλ λ°©ν₯μΌλ‘ μ°κ΅¬λμ΄μλ€. νμ§λ§ μ€μ λΉλμ€ λ°μ΄ν°λ λ€μν λ¬Όμ²΄λ€ λ° λ³΅μ‘ν λ°°κ²½μ μμ§μ, μ΄μ λ°λ₯Έ κ°λ¦¬μ΄μ§, λΉλμ€λ§λ€ λ¬λΌμ§λ νλ μμ¨ λ± λ§€μ° λ€μν νκ²½μ λ΄κ³ μλ€. λ°λΌμ νλμ λͺ¨λΈλ‘ λͺ¨λ νκ²½μ μΌλ°μ μΌλ‘ μ λμνλ λͺ¨λΈμ νμ΅νλ κ²μ μλ§μ νμ΅ λ°μ΄ν°λ₯Ό νμ©νμ¬λ λ§€μ° μ΄λ €μ΄ λ¬Έμ μ΄λ€.
λ³Έ νμ λ
Όλ¬Έμμλ λΉλμ€ νλ μ λ³΄κ° λ¬Έμ λ₯Ό ν΄κ²°νκΈ° μν ν
μ€νΈ λ¨κ³μ μ μμ λ°©λ²λ‘ λ€μ μ μνλ€. νΉν λ₯λ¬λ κΈ°λ°μ νλ μμν¬λ₯Ό μ μμ μΌλ‘ λ§λ€κΈ° μνμ¬ (1) νΌμ³ νμ±λ (feature activation), (2) λͺ¨λΈμ νλΌλ―Έν°, κ·Έλ¦¬κ³ (3) λ€νΈμν¬μ ꡬ쑰λ₯Ό λ³νν μ μλλ‘ νλ μΈ κ°μ§μ μκ³ λ¦¬μ¦μ μ μνλ€. 첫 λ²μ§Έ μκ³ λ¦¬μ¦μ λ₯ μ κ²½λ§ λ€νΈμν¬μ λ΄λΆ νΌμ³ νμ±λμ ν¬κΈ°λ₯Ό κ°κ°μ μ
λ ₯ νλ μμ λ°λΌ μ μμ μΌλ‘ μ‘°μ νλλ‘ νλ©°, μ΄ν
μ
λͺ¨λΈμ νμ©νμ¬ μ νν νλ μ λ³΄κ° μ±λ₯μ μ»μ μ μμλ€. μ΅ν°μ»¬ νλ‘μ° μμΈ‘ λͺ¨λΈμ νμ©νμ¬ ν½μ
λ¨μλ‘ μμ§μ μ 보λ₯Ό μΆμΆν λλΆλΆμ κΈ°μ‘΄ λ°©μλ€κ³Ό λ¬λ¦¬, μ μν μ±λ μ΄ν
μ
κΈ°λ°μ λͺ¨λΈμ λ³λμ λͺ¨μ
λͺ¨λΈ μμ΄λ λ§€μ° μ νν μ€κ° νλ μμ μμ±ν μ μλ€. λ λ²μ§Έλ‘ μ μνλ μκ³ λ¦¬μ¦μ νλ μ λ³΄κ° λͺ¨λΈμ κ° νλΌλ―Έν° κ°μ μ μμ μΌλ‘ λ³κ²½ν μ μλλ‘ λ©νλ¬λ (meta-learning) λ°©λ²λ‘ μ μ¬μ©νλ€. κ°κ°μ μ
λ ₯ λΉλμ€ μνμ€λ§λ€ λͺ¨λΈμ νλΌλ―Έν° κ°μ μ μμ μΌλ‘ μ
λ°μ΄νΈν μ μλλ‘ νμ΅μμΌ μ€μΌλ‘μ¨, μ μν νλ μμν¬λ κΈ°μ‘΄μ μ΄λ€ νλ μ λ³΄κ° λͺ¨λΈμ μ¬μ©νλλΌλ λ¨ ν λ²μ κ·ΈλΌλμΈνΈ μ
λ°μ΄νΈλ₯Ό ν΅ν΄ μΌκ΄λ μ±λ₯ ν₯μμ 보μλ€. λ§μ§λ§μΌλ‘, μ
λ ₯μ λ°λΌ λ€νΈμν¬μ κ΅¬μ‘°κ° λμ μΌλ‘ λ³νλλ νλ μμν¬λ₯Ό μ μνμ¬ κ³΅κ°μ μΌλ‘ λΆν λ νλ μμ κ° μ§μλ§λ€ μλ‘ λ€λ₯Έ μΆλ‘ κ²½λ‘λ₯Ό ν΅κ³Όνκ³ , λΆνμν κ³μ°λμ μλΉ λΆλΆ μ€μΌ μ μλλ‘ νλ€. μ μνλ λμ λ€νΈμν¬λ μ
λ ₯ νλ μμ ν¬κΈ°μ νλ μ λ³΄κ° λͺ¨λΈμ κΉμ΄λ₯Ό μ‘°μ ν¨μΌλ‘μ¨ λ² μ΄μ€λΌμΈ λͺ¨λΈμ μ±λ₯μ μ μ§νλ©΄μ κ³μ° ν¨μ¨μ±μ ν¬κ² μ¦κ°νμλ€.
λ³Έ νμ λ
Όλ¬Έμμ μ μν μΈ κ°μ§μ μ μμ λ°©λ²λ‘ μ ν¨κ³Όλ λΉλμ€ νλ μ 보κ°λ²μ μν μ¬λ¬ λ²€μΉλ§ν¬ λ°μ΄ν°μ
μ λ©΄λ°νκ² νκ°λμλ€. νΉν, λ€μν νμ΄νΌνλΌλ―Έν° μΈν
κ³Ό μ¬λ¬ λ² μ΄μ€λΌμΈ λͺ¨λΈμ λν λΉκ΅, λΆμ μ€νμ ν΅ν΄ ν
μ€νΈ λ¨κ³μμμ μ μμ λ°©λ²λ‘ μ λν ν¨κ³Όλ₯Ό μ
μ¦νμλ€. μ΄λ λΉλμ€ νλ μ 보κ°λ²μ λν μ΅μ κ²°κ³Όλ€μ μΆκ°μ μΌλ‘ μ μ©λ μ μλ μλ‘μ΄ μ°κ΅¬ λ°©λ²μΌλ‘, μΆν λ€λ°©λ©΄μΌλ‘μ νμ₯μ±μ΄ κΈ°λλλ€.1 Introduction 1
1.1 Motivations 1
1.2 Proposed method 3
1.3 Contributions 5
1.4 Organization of dissertation 6
2 Feature Adaptation based Approach 7
2.1 Introduction 7
2.2 Related works 10
2.2.1 Video frame interpolation 10
2.2.2 Attention mechanism 12
2.3 Proposed Method 12
2.3.1 Overview of network architecture 13
2.3.2 Main components 14
2.3.3 Loss 16
2.4 Understanding our model 17
2.4.1 Internal feature visualization 18
2.4.2 Intermediate image reconstruction 21
2.5 Experiments 23
2.5.1 Datasets 23
2.5.2 Implementation details 25
2.5.3 Comparison to the state-of-the-art 26
2.5.4 Ablation study 36
2.6 Summary 38
3 Meta-Learning based Approach 39
3.1 Introduction 39
3.2 Related works 42
3.3 Proposed method 44
3.3.1 Video frame interpolation problem set-up 44
3.3.2 Exploiting extra information at test time 45
3.3.3 Background on MAML 48
3.3.4 MetaVFI: Meta-learning for frame interpolation 49
3.4 Experiments 54
3.4.1 Settings 54
3.4.2 Meta-learning algorithm selection 56
3.4.3 Video frame interpolation results 58
3.4.4 Ablation studies 66
3.5 Summary 69
4 Dynamic Architecture based Approach 71
4.1 Introduction 71
4.2 Related works 75
4.2.1 Video frame interpolation 75
4.2.2 Adaptive inference 76
4.3 Proposed Method 77
4.3.1 Dynamic framework overview 77
4.3.2 Scale and depth finder (SD-finder) 80
4.3.3 Dynamic interpolation model 82
4.3.4 Training 83
4.4 Experiments 85
4.4.1 Datasets 85
4.4.2 Implementation details 86
4.4.3 Quantitative comparison 87
4.4.4 Visual comparison 93
4.4.5 Ablation study 97
4.5 Summary 100
5 Conclusion 103
5.1 Summary of dissertation 103
5.2 Future works 104
Bibliography 107
κ΅λ¬Έμ΄λ‘ 120Docto
Driving steady-state visual evoked potentials at arbitrary frequencies using temporal interpolation of stimulus presentation
Date of Acceptance: 29/10/2015 We thank Renate Zahn for help with data collection. This work was supported by Deutsche Forschungsgemeinschaft (AN 841/1-1, MU 972/20-1). We would like to thank A. Trujillo-Ortiz, R. Hernandez-Walls, A. Castro-Perez and K. BarbaRojo (Universidad Autonoma de Baja California) for making Matlab code for non-sphericity corrections freely available.Peer reviewedPublisher PD
H-VFI: Hierarchical Frame Interpolation for Videos with Large Motions
Capitalizing on the rapid development of neural networks, recent video frame
interpolation (VFI) methods have achieved notable improvements. However, they
still fall short for real-world videos containing large motions. Complex
deformation and/or occlusion caused by large motions make it an extremely
difficult problem in video frame interpolation. In this paper, we propose a
simple yet effective solution, H-VFI, to deal with large motions in video frame
interpolation. H-VFI contributes a hierarchical video interpolation transformer
(HVIT) to learn a deformable kernel in a coarse-to-fine strategy in multiple
scales. The learnt deformable kernel is then utilized in convolving the input
frames for predicting the interpolated frame. Starting from the smallest scale,
H-VFI updates the deformable kernel by a residual in succession based on former
predicted kernels, intermediate interpolated results and hierarchical features
from transformer. Bias and masks to refine the final outputs are then predicted
by a transformer block based on interpolated results. The advantage of such a
progressive approximation is that the large motion frame interpolation problem
can be decomposed into several relatively simpler sub-tasks, which enables a
very accurate prediction in the final results. Another noteworthy contribution
of our paper consists of a large-scale high-quality dataset, YouTube200K, which
contains videos depicting a great variety of scenarios captured at high
resolution and high frame rate. Extensive experiments on multiple frame
interpolation benchmarks validate that H-VFI outperforms existing
state-of-the-art methods especially for videos with large motions
- β¦