1,491 research outputs found
End-to-End Learning of Video Super-Resolution with Motion Compensation
Learning approaches have shown great success in the task of super-resolving
an image given a low resolution input. Video super-resolution aims for
exploiting additionally the information from multiple images. Typically, the
images are related via optical flow and consecutive image warping. In this
paper, we provide an end-to-end video super-resolution network that, in
contrast to previous works, includes the estimation of optical flow in the
overall network architecture. We analyze the usage of optical flow for video
super-resolution and find that common off-the-shelf image warping does not
allow video super-resolution to benefit much from optical flow. We rather
propose an operation for motion compensation that performs warping from low to
high resolution directly. We show that with this network configuration, video
super-resolution can benefit from optical flow and we obtain state-of-the-art
results on the popular test sets. We also show that the processing of whole
images rather than independent patches is responsible for a large increase in
accuracy.Comment: Accepted to GCPR201
๋น๋์ค ํ๋ ์ ๋ณด๊ฐ์ ์ํ ๋ค์ค ๋ฒกํฐ ๊ธฐ๋ฐ์ MEMC ๋ฐ ์ฌ์ธต CNN
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2019. 2. ์ดํ์ฌ.Block-based hierarchical motion estimations are widely used and are successful in generating high-quality interpolation. However, it still fails in the motion estimation of small objects when a background region moves in a different direction. This is because the motion of small objects is neglected by the down-sampling and over-smoothing operations at the top level of image pyramids in the maximum a posterior (MAP) method. Consequently, the motion vector of small objects cannot be detected at the bottom level, and therefore, the small objects often appear deformed in an interpolated frame. This thesis proposes a novel algorithm that preserves the motion vector of the small objects by adding a secondary motion vector candidate that represents the movement of the small objects. This additional candidate is always propagated from the top to the bottom layers of the image pyramid. Experimental results demonstrate that the intermediate frame interpolated by the proposed algorithm significantly improves the visual quality when compared with conventional MAP-based frame interpolation.
In motion compensated frame interpolation, a repetition pattern in an image makes it difficult to derive an accurate motion vector because multiple similar local minima exist in the search space of the matching cost for motion estimation. In order to improve the accuracy of motion estimation in a repetition region, this thesis attempts a semi-global approach that exploits both local and global characteristics of a repetition region. A histogram of the motion vector candidates is built by using a voter based voting system that is more reliable than an elector based voting system. Experimental results demonstrate that the proposed method significantly outperforms the previous local approach in term of both objective peak signal-to-noise ratio (PSNR) and subjective visual quality.
In video frame interpolation or motion-compensated frame rate up-conversion (MC-FRUC), motion compensation along unidirectional motion trajectories directly causes overlaps and holes issues. To solve these issues, this research presents a new algorithm for bidirectional motion compensated frame interpolation. Firstly, the proposed method generates bidirectional motion vectors from two unidirectional motion vector fields (forward and backward) obtained from the unidirectional motion estimations. It is done by projecting the forward and backward motion vectors into the interpolated frame. A comprehensive metric as an extension of the distance between a projected block and an interpolated block is proposed to compute weighted coefficients in the case when the interpolated block has multiple projected ones. Holes are filled based on vector median filter of non-hole available neighbor blocks. The proposed method outperforms existing MC-FRUC methods and removes block artifacts significantly.
Video frame interpolation with a deep convolutional neural network (CNN) is also investigated in this thesis. Optical flow and video frame interpolation are considered as a chicken-egg problem such that one problem affects the other and vice versa. This thesis presents a stack of networks that are trained to estimate intermediate optical flows from the very first intermediate synthesized frame and later the very end interpolated frame is generated by the second synthesis network that is fed by stacking the very first one and two learned intermediate optical flows based warped frames. The primary benefit is that it glues two problems into one comprehensive framework that learns altogether by using both an analysis-by-synthesis technique for optical flow estimation and vice versa, CNN kernels based synthesis-by-analysis. The proposed network is the first attempt to bridge two branches of previous approaches, optical flow based synthesis and CNN kernels based synthesis into a comprehensive network. Experiments are carried out with various challenging datasets, all showing that the proposed network outperforms the state-of-the-art methods with significant margins for video frame interpolation and the estimated optical flows are accurate for challenging movements. The proposed deep video frame interpolation network to post-processing is applied to the improvement of the coding efficiency of the state-of-art video compress standard, HEVC/H.265 and experimental results prove the efficiency of the proposed network.๋ธ๋ก ๊ธฐ๋ฐ ๊ณ์ธต์ ์์ง์ ์ถ์ ์ ๊ณ ํ์ง์ ๋ณด๊ฐ ์ด๋ฏธ์ง๋ฅผ ์์ฑํ ์ ์์ด ํญ๋๊ฒ ์ฌ์ฉ๋๊ณ ์๋ค. ํ์ง๋ง, ๋ฐฐ๊ฒฝ ์์ญ์ด ์์ง์ผ ๋, ์์ ๋ฌผ์ฒด์ ๋ํ ์์ง์ ์ถ์ ์ฑ๋ฅ์ ์ฌ์ ํ ์ข์ง ์๋ค. ์ด๋ maximum a posterior (MAP) ๋ฐฉ์์ผ๋ก ์ด๋ฏธ์ง ํผ๋ผ๋ฏธ๋์ ์ต์์ ๋ ๋ฒจ์์ down-sampling๊ณผ over-smoothing์ผ๋ก ์ธํด ์์ ๋ฌผ์ฒด์ ์์ง์์ด ๋ฌด์๋๊ธฐ ๋๋ฌธ์ด๋ค. ๊ฒฐ๊ณผ์ ์ผ๋ก ์ด๋ฏธ์ง ํผ๋ผ๋ฏธ๋์ ์ตํ์ ๋ ๋ฒจ์์ ์์ ๋ฌผ์ฒด์ ์์ง์ ๋ฒกํฐ๋ ๊ฒ์ถ๋ ์ ์์ด ๋ณด๊ฐ ์ด๋ฏธ์ง์์ ์์ ๋ฌผ์ฒด๋ ์ข
์ข
๋ณํ๋ ๊ฒ์ฒ๋ผ ๋ณด์ธ๋ค. ๋ณธ ๋
ผ๋ฌธ์์๋ ์์ ๋ฌผ์ฒด์ ์์ง์์ ๋ํ๋ด๋ 2์ฐจ ์์ง์ ๋ฒกํฐ ํ๋ณด๋ฅผ ์ถ๊ฐํ์ฌ ์์ ๋ฌผ์ฒด์ ์์ง์ ๋ฒกํฐ๋ฅผ ๋ณด์กดํ๋ ์๋ก์ด ์๊ณ ๋ฆฌ์ฆ์ ์ ์ํ๋ค. ์ถ๊ฐ๋ ์์ง์ ๋ฒกํฐ ํ๋ณด๋ ํญ์ ์ด๋ฏธ์ง ํผ๋ผ๋ฏธ๋์ ์ต์์์์ ์ตํ์ ๋ ๋ฒจ๋ก ์ ํ๋๋ค. ์คํ ๊ฒฐ๊ณผ๋ ์ ์๋ ์๊ณ ๋ฆฌ์ฆ์ ๋ณด๊ฐ ์์ฑ ํ๋ ์์ด ๊ธฐ์กด MAP ๊ธฐ๋ฐ ๋ณด๊ฐ ๋ฐฉ์์ผ๋ก ์์ฑ๋ ํ๋ ์๋ณด๋ค ์ด๋ฏธ์ง ํ์ง์ด ์๋นํ ํฅ์๋จ์ ๋ณด์ฌ์ค๋ค.
์์ง์ ๋ณด์ ํ๋ ์ ๋ณด๊ฐ์์, ์ด๋ฏธ์ง ๋ด์ ๋ฐ๋ณต ํจํด์ ์์ง์ ์ถ์ ์ ์ํ ์ ํฉ ์ค์ฐจ ํ์ ์ ๋ค์์ ์ ์ฌ local minima๊ฐ ์กด์ฌํ๊ธฐ ๋๋ฌธ์ ์ ํํ ์์ง์ ๋ฒกํฐ ์ ๋๋ฅผ ์ด๋ ต๊ฒ ํ๋ค. ๋ณธ ๋
ผ๋ฌธ์ ๋ฐ๋ณต ํจํด์์์ ์์ง์ ์ถ์ ์ ์ ํ๋๋ฅผ ํฅ์์ํค๊ธฐ ์ํด ๋ฐ๋ณต ์์ญ์ localํ ํน์ฑ๊ณผ globalํ ํน์ฑ์ ๋์์ ํ์ฉํ๋ semi-globalํ ์ ๊ทผ์ ์๋ํ๋ค. ์์ง์ ๋ฒกํฐ ํ๋ณด์ ํ์คํ ๊ทธ๋จ์ ์ ๊ฑฐ ๊ธฐ๋ฐ ํฌํ ์์คํ
๋ณด๋ค ์ ๋ขฐํ ์ ์๋ ์ ๊ถ์ ๊ธฐ๋ฐ ํฌํ ์์คํ
๊ธฐ๋ฐ์ผ๋ก ํ์ฑ๋๋ค. ์คํ ๊ฒฐ๊ณผ๋ ์ ์๋ ๋ฐฉ๋ฒ์ด ์ด์ ์ localํ ์ ๊ทผ๋ฒ๋ณด๋ค peak signal-to-noise ratio (PSNR)์ ์ฃผ๊ด์ ํ์ง ํ๋จ ๊ด์ ์์ ์๋นํ ์ฐ์ํจ์ ๋ณด์ฌ์ค๋ค.
๋น๋์ค ํ๋ ์ ๋ณด๊ฐ ๋๋ ์์ง์ ๋ณด์ ํ๋ ์์จ ์ํฅ ๋ณํ (MC-FRUC)์์, ๋จ๋ฐฉํฅ ์์ง์ ๊ถค์ ์ ๋ฐ๋ฅธ ์์ง์ ๋ณด์์ overlap๊ณผ hole ๋ฌธ์ ๋ฅผ ์ผ์ผํจ๋ค. ๋ณธ ์ฐ๊ตฌ์์ ์ด๋ฌํ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ์๋ฐฉํฅ ์์ง์ ๋ณด์ ํ๋ ์ ๋ณด๊ฐ์ ์ํ ์๋ก์ด ์๊ณ ๋ฆฌ์ฆ์ ์ ์ํ๋ค. ๋จผ์ , ์ ์๋ ๋ฐฉ๋ฒ์ ๋จ๋ฐฉํฅ ์์ง์ ์ถ์ ์ผ๋ก๋ถํฐ ์ป์ด์ง ๋ ๊ฐ์ ๋จ๋ฐฉํฅ ์์ง์ ์์ญ(์ ๋ฐฉ ๋ฐ ํ๋ฐฉ)์ผ๋ก๋ถํฐ ์๋ฐฉํฅ ์์ง์ ๋ฒกํฐ๋ฅผ ์์ฑํ๋ค. ์ด๋ ์ ๋ฐฉ ๋ฐ ํ๋ฐฉ ์์ง์ ๋ฒกํฐ๋ฅผ ๋ณด๊ฐ ํ๋ ์์ ํฌ์ํจ์ผ๋ก์จ ์ํ๋๋ค. ๋ณด๊ฐ๋ ๋ธ๋ก์ ์ฌ๋ฌ ๊ฐ์ ํฌ์๋ ๋ธ๋ก์ด ์๋ ๊ฒฝ์ฐ, ํฌ์๋ ๋ธ๋ก๊ณผ ๋ณด๊ฐ๋ ๋ธ๋ก ์ฌ์ด์ ๊ฑฐ๋ฆฌ๋ฅผ ํ์ฅํ๋ ๊ธฐ์ค์ด ๊ฐ์ค ๊ณ์๋ฅผ ๊ณ์ฐํ๊ธฐ ์ํด ์ ์๋๋ค. Hole์ hole์ด ์๋ ์ด์ ๋ธ๋ก์ vector median filter๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ์ฒ๋ฆฌ๋๋ค. ์ ์ ๋ฐฉ๋ฒ์ ๊ธฐ์กด์ MC-FRUC๋ณด๋ค ์ฑ๋ฅ์ด ์ฐ์ํ๋ฉฐ, ๋ธ๋ก ์ดํ๋ฅผ ์๋นํ ์ ๊ฑฐํ๋ค.
๋ณธ ๋
ผ๋ฌธ์์๋ CNN์ ์ด์ฉํ ๋น๋์ค ํ๋ ์ ๋ณด๊ฐ์ ๋ํด์๋ ๋ค๋ฃฌ๋ค. Optical flow ๋ฐ ๋น๋์ค ํ๋ ์ ๋ณด๊ฐ์ ํ ๊ฐ์ง ๋ฌธ์ ๊ฐ ๋ค๋ฅธ ๋ฌธ์ ์ ์ํฅ์ ๋ฏธ์น๋ chicken-egg ๋ฌธ์ ๋ก ๊ฐ์ฃผ๋๋ค. ๋ณธ ๋
ผ๋ฌธ์์๋ ์ค๊ฐ optical flow ๋ฅผ ๊ณ์ฐํ๋ ๋คํธ์ํฌ์ ๋ณด๊ฐ ํ๋ ์์ ํฉ์ฑ ํ๋ ๋ ๊ฐ์ง ๋คํธ์ํฌ๋ก ์ด๋ฃจ์ด์ง ํ๋์ ๋คํธ์ํฌ ์คํ์ ๊ตฌ์กฐ๋ฅผ ์ ์ํ๋ค. The final ๋ณด๊ฐ ํ๋ ์์ ์์ฑํ๋ ๋คํธ์ํฌ์ ๊ฒฝ์ฐ ์ฒซ ๋ฒ์งธ ๋คํธ์ํฌ์ ์ถ๋ ฅ์ธ ๋ณด๊ฐ ํ๋ ์ ์ ์ค๊ฐ optical flow based warped frames์ ์
๋ ฅ์ผ๋ก ๋ฐ์์ ํ๋ ์์ ์์ฑํ๋ค. ์ ์๋ ๊ตฌ์กฐ์ ๊ฐ์ฅ ํฐ ํน์ง์ optical flow ๊ณ์ฐ์ ์ํ ํฉ์ฑ์ ์ํ ๋ถ์๋ฒ๊ณผ CNN ๊ธฐ๋ฐ์ ๋ถ์์ ์ํ ํฉ์ฑ๋ฒ์ ๋ชจ๋ ์ด์ฉํ์ฌ ํ๋์ ์ข
ํฉ์ ์ธ framework๋ก ๊ฒฐํฉํ์๋ค๋ ๊ฒ์ด๋ค. ์ ์๋ ๋คํธ์ํฌ๋ ๊ธฐ์กด์ ๋ ๊ฐ์ง ์ฐ๊ตฌ์ธ optical flow ๊ธฐ๋ฐ ํ๋ ์ ํฉ์ฑ๊ณผ CNN ๊ธฐ๋ฐ ํฉ์ฑ ํ๋ ์ ํฉ์ฑ๋ฒ์ ์ฒ์ ๊ฒฐํฉ์ํจ ๋ฐฉ์์ด๋ค. ์คํ์ ๋ค์ํ๊ณ ๋ณต์กํ ๋ฐ์ดํฐ ์
์ผ๋ก ์ด๋ฃจ์ด์ก์ผ๋ฉฐ, ๋ณด๊ฐ ํ๋ ์ quality ์ optical flow ๊ณ์ฐ ์ ํ๋ ์ธก๋ฉด์์ ๊ธฐ์กด์ state-of-art ๋ฐฉ์์ ๋นํด ์๋ฑํ ๋์ ์ฑ๋ฅ์ ๋ณด์๋ค. ๋ณธ ๋
ผ๋ฌธ์ ํ ์ฒ๋ฆฌ๋ฅผ ์ํ ์ฌ์ธต ๋น๋์ค ํ๋ ์ ๋ณด๊ฐ ๋คํธ์ํฌ๋ ์ฝ๋ฉ ํจ์จ ํฅ์์ ์ํด ์ต์ ๋น๋์ค ์์ถ ํ์ค์ธ HEVC/H.265์ ์ ์ฉํ ์ ์์ผ๋ฉฐ, ์คํ ๊ฒฐ๊ณผ๋ ์ ์ ๋คํธ์ํฌ์ ํจ์จ์ฑ์ ์
์ฆํ๋ค.Abstract i
Table of Contents iv
List of Tables vii
List of Figures viii
Chapter 1. Introduction 1
1.1. Hierarchical Motion Estimation of Small Objects 2
1.2. Motion Estimation of a Repetition Pattern Region 4
1.3. Motion-Compensated Frame Interpolation 5
1.4. Video Frame Interpolation with Deep CNN 6
1.5. Outline of the Thesis 7
Chapter 2. Previous Works 9
2.1. Previous Works on Hierarchical Block-Based Motion Estimation 9
2.1.1.โMaximum a Posterior (MAP) Framework 10
2.1.2.Hierarchical Motion Estimation 12
2.2. Previous Works on Motion Estimation for a Repetition Pattern Region 13
2.3. Previous Works on Motion Compensation 14
2.4. Previous Works on Video Frame Interpolation with Deep CNN 16
Chapter 3. Hierarchical Motion Estimation for Small Objects 19
3.1. Problem Statement 19
3.2. The Alternative Motion Vector of High Cost Pixels 20
3.3. Modified Hierarchical Motion Estimation 23
3.4. Framework of the Proposed Algorithm 24
3.5. Experimental Results 25
3.5.1. Performance Analysis 26
3.5.2. Performance Evaluation 29
Chapter 4. Semi-Global Accurate Motion Estimation for a Repetition Pattern Region 32
4.1. Problem Statement 32
4.2. Objective Function and Constrains 33
4.3. Elector based Voting System 34
4.4. Voter based Voting System 36
4.5. Experimental Results 40
Chapter 5. Multiple Motion Vectors based Motion Compensation 44
5.1. Problem Statement 44
5.2. Adaptive Weighted Multiple Motion Vectors based Motion Compensation 45
5.2.1. One-to-Multiple Motion Vector Projection 45
5.2.2. A Comprehensive Metric as the Extension of Distance 48
5.3. Handling Hole Blocks 49
5.4. Framework of the Proposed Motion Compensated Frame Interpolation 50
5.5. Experimental Results 51
Chapter 6. Video Frame Interpolation with a Stack of Deep CNN 56
6.1. Problem Statement 56
6.2. The Proposed Network for Video Frame Interpolation 57
6.2.1. A Stack of Synthesis Networks 57
6.2.2. Intermediate Optical Flow Derivation Module 60
6.2.3. Warping Operations 62
6.2.4. Training and Loss Function 63
6.2.5. Network Architecture 64
6.2.6. Experimental Results 64
6.2.6.1. Frame Interpolation Evaluation 64
6.2.6.2. Ablation Experiments 77
6.3. Extension for Quality Enhancement for Compressed Videos Task 83
6.4. Extension for Improving the Coding Efficiency of HEVC based Low Bitrate Encoder 88
Chapter 7. Conclusion 94
References 97Docto
Context-aware Synthesis for Video Frame Interpolation
Video frame interpolation algorithms typically estimate optical flow or its
variations and then use it to guide the synthesis of an intermediate frame
between two consecutive original frames. To handle challenges like occlusion,
bidirectional flow between the two input frames is often estimated and used to
warp and blend the input frames. However, how to effectively blend the two
warped frames still remains a challenging problem. This paper presents a
context-aware synthesis approach that warps not only the input frames but also
their pixel-wise contextual information and uses them to interpolate a
high-quality intermediate frame. Specifically, we first use a pre-trained
neural network to extract per-pixel contextual information for input frames. We
then employ a state-of-the-art optical flow algorithm to estimate bidirectional
flow between them and pre-warp both input frames and their context maps.
Finally, unlike common approaches that blend the pre-warped frames, our method
feeds them and their context maps to a video frame synthesis neural network to
produce the interpolated frame in a context-aware fashion. Our neural network
is fully convolutional and is trained end to end. Our experiments show that our
method can handle challenging scenarios such as occlusion and large motion and
outperforms representative state-of-the-art approaches.Comment: CVPR 2018, http://graphics.cs.pdx.edu/project/ctxsy
Multi-Frame Quality Enhancement for Compressed Video
The past few years have witnessed great success in applying deep learning to
enhance the quality of compressed image/video. The existing approaches mainly
focus on enhancing the quality of a single frame, ignoring the similarity
between consecutive frames. In this paper, we investigate that heavy quality
fluctuation exists across compressed video frames, and thus low quality frames
can be enhanced using the neighboring high quality frames, seen as Multi-Frame
Quality Enhancement (MFQE). Accordingly, this paper proposes an MFQE approach
for compressed video, as a first attempt in this direction. In our approach, we
firstly develop a Support Vector Machine (SVM) based detector to locate Peak
Quality Frames (PQFs) in compressed video. Then, a novel Multi-Frame
Convolutional Neural Network (MF-CNN) is designed to enhance the quality of
compressed video, in which the non-PQF and its nearest two PQFs are as the
input. The MF-CNN compensates motion between the non-PQF and PQFs through the
Motion Compensation subnet (MC-subnet). Subsequently, the Quality Enhancement
subnet (QE-subnet) reduces compression artifacts of the non-PQF with the help
of its nearest PQFs. Finally, the experiments validate the effectiveness and
generality of our MFQE approach in advancing the state-of-the-art quality
enhancement of compressed video. The code of our MFQE approach is available at
https://github.com/ryangBUAA/MFQE.gitComment: to appear in CVPR 201
Temporal Interpolation via Motion Field Prediction
Navigated 2D multi-slice dynamic Magnetic Resonance (MR) imaging enables high
contrast 4D MR imaging during free breathing and provides in-vivo observations
for treatment planning and guidance. Navigator slices are vital for
retrospective stacking of 2D data slices in this method. However, they also
prolong the acquisition sessions. Temporal interpolation of navigator slices an
be used to reduce the number of navigator acquisitions without degrading
specificity in stacking. In this work, we propose a convolutional neural
network (CNN) based method for temporal interpolation via motion field
prediction. The proposed formulation incorporates the prior knowledge that a
motion field underlies changes in the image intensities over time. Previous
approaches that interpolate directly in the intensity space are prone to
produce blurry images or even remove structures in the images. Our method
avoids such problems and faithfully preserves the information in the image.
Further, an important advantage of our formulation is that it provides an
unsupervised estimation of bi-directional motion fields. We show that these
motion fields can be used to halve the number of registrations required during
4D reconstruction, thus substantially reducing the reconstruction time.Comment: Submitted to 1st Conference on Medical Imaging with Deep Learning
(MIDL 2018), Amsterdam, The Netherland
- โฆ