7,130 research outputs found

    Two-Stream Convolutional Networks for Action Recognition in Videos

    Full text link
    We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to generalise the best performing hand-crafted features within a data-driven learning framework. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multi-task learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification

    Detection of dirt impairments from archived film sequences : survey and evaluations

    Get PDF
    Film dirt is the most commonly encountered artifact in archive restoration applications. Since dirt usually appears as a temporally impulsive event, motion-compensated interframe processing is widely applied for its detection. However, motion-compensated prediction requires a high degree of complexity and can be unreliable when motion estimation fails. Consequently, many techniques using spatial or spatiotemporal filtering without motion were also been proposed as alternatives. A comprehensive survey and evaluation of existing methods is presented, in which both qualitative and quantitative performances are compared in terms of accuracy, robustness, and complexity. After analyzing these algorithms and identifying their limitations, we conclude with guidance in choosing from these algorithms and promising directions for future research

    Learning for Video Compression with Hierarchical Quality and Recurrent Enhancement

    Full text link
    In this paper, we propose a Hierarchical Learned Video Compression (HLVC) method with three hierarchical quality layers and a recurrent enhancement network. The frames in the first layer are compressed by an image compression method with the highest quality. Using these frames as references, we propose the Bi-Directional Deep Compression (BDDC) network to compress the second layer with relatively high quality. Then, the third layer frames are compressed with the lowest quality, by the proposed Single Motion Deep Compression (SMDC) network, which adopts a single motion map to estimate the motions of multiple frames, thus saving bits for motion information. In our deep decoder, we develop the Weighted Recurrent Quality Enhancement (WRQE) network, which takes both compressed frames and the bit stream as inputs. In the recurrent cell of WRQE, the memory and update signal are weighted by quality features to reasonably leverage multi-frame information for enhancement. In our HLVC approach, the hierarchical quality benefits the coding efficiency, since the high quality information facilitates the compression and enhancement of low quality frames at encoder and decoder sides, respectively. Finally, the experiments validate that our HLVC approach advances the state-of-the-art of deep video compression methods, and outperforms the "Low-Delay P (LDP) very fast" mode of x265 in terms of both PSNR and MS-SSIM. The project page is at https://github.com/RenYang-home/HLVC.Comment: Published in CVPR 2020; corrected a minor typo in the footnote of Table 1; corrected Figure 1

    비디오 프레임 보간을 위한 다중 벡터 기반의 MEMC 및 심층 CNN

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 공과대학 전기·정보공학부, 2019. 2. 이혁재.Block-based hierarchical motion estimations are widely used and are successful in generating high-quality interpolation. However, it still fails in the motion estimation of small objects when a background region moves in a different direction. This is because the motion of small objects is neglected by the down-sampling and over-smoothing operations at the top level of image pyramids in the maximum a posterior (MAP) method. Consequently, the motion vector of small objects cannot be detected at the bottom level, and therefore, the small objects often appear deformed in an interpolated frame. This thesis proposes a novel algorithm that preserves the motion vector of the small objects by adding a secondary motion vector candidate that represents the movement of the small objects. This additional candidate is always propagated from the top to the bottom layers of the image pyramid. Experimental results demonstrate that the intermediate frame interpolated by the proposed algorithm significantly improves the visual quality when compared with conventional MAP-based frame interpolation. In motion compensated frame interpolation, a repetition pattern in an image makes it difficult to derive an accurate motion vector because multiple similar local minima exist in the search space of the matching cost for motion estimation. In order to improve the accuracy of motion estimation in a repetition region, this thesis attempts a semi-global approach that exploits both local and global characteristics of a repetition region. A histogram of the motion vector candidates is built by using a voter based voting system that is more reliable than an elector based voting system. Experimental results demonstrate that the proposed method significantly outperforms the previous local approach in term of both objective peak signal-to-noise ratio (PSNR) and subjective visual quality. In video frame interpolation or motion-compensated frame rate up-conversion (MC-FRUC), motion compensation along unidirectional motion trajectories directly causes overlaps and holes issues. To solve these issues, this research presents a new algorithm for bidirectional motion compensated frame interpolation. Firstly, the proposed method generates bidirectional motion vectors from two unidirectional motion vector fields (forward and backward) obtained from the unidirectional motion estimations. It is done by projecting the forward and backward motion vectors into the interpolated frame. A comprehensive metric as an extension of the distance between a projected block and an interpolated block is proposed to compute weighted coefficients in the case when the interpolated block has multiple projected ones. Holes are filled based on vector median filter of non-hole available neighbor blocks. The proposed method outperforms existing MC-FRUC methods and removes block artifacts significantly. Video frame interpolation with a deep convolutional neural network (CNN) is also investigated in this thesis. Optical flow and video frame interpolation are considered as a chicken-egg problem such that one problem affects the other and vice versa. This thesis presents a stack of networks that are trained to estimate intermediate optical flows from the very first intermediate synthesized frame and later the very end interpolated frame is generated by the second synthesis network that is fed by stacking the very first one and two learned intermediate optical flows based warped frames. The primary benefit is that it glues two problems into one comprehensive framework that learns altogether by using both an analysis-by-synthesis technique for optical flow estimation and vice versa, CNN kernels based synthesis-by-analysis. The proposed network is the first attempt to bridge two branches of previous approaches, optical flow based synthesis and CNN kernels based synthesis into a comprehensive network. Experiments are carried out with various challenging datasets, all showing that the proposed network outperforms the state-of-the-art methods with significant margins for video frame interpolation and the estimated optical flows are accurate for challenging movements. The proposed deep video frame interpolation network to post-processing is applied to the improvement of the coding efficiency of the state-of-art video compress standard, HEVC/H.265 and experimental results prove the efficiency of the proposed network.블록 기반 계층적 움직임 추정은 고화질의 보간 이미지를 생성할 수 있어 폭넓게 사용되고 있다. 하지만, 배경 영역이 움직일 때, 작은 물체에 대한 움직임 추정 성능은 여전히 좋지 않다. 이는 maximum a posterior (MAP) 방식으로 이미지 피라미드의 최상위 레벨에서 down-sampling과 over-smoothing으로 인해 작은 물체의 움직임이 무시되기 때문이다. 결과적으로 이미지 피라미드의 최하위 레벨에서 작은 물체의 움직임 벡터는 검출될 수 없어 보간 이미지에서 작은 물체는 종종 변형된 것처럼 보인다. 본 논문에서는 작은 물체의 움직임을 나타내는 2차 움직임 벡터 후보를 추가하여 작은 물체의 움직임 벡터를 보존하는 새로운 알고리즘을 제안한다. 추가된 움직임 벡터 후보는 항상 이미지 피라미드의 최상위에서 최하위 레벨로 전파된다. 실험 결과는 제안된 알고리즘의 보간 생성 프레임이 기존 MAP 기반 보간 방식으로 생성된 프레임보다 이미지 화질이 상당히 향상됨을 보여준다. 움직임 보상 프레임 보간에서, 이미지 내의 반복 패턴은 움직임 추정을 위한 정합 오차 탐색 시 다수의 유사 local minima가 존재하기 때문에 정확한 움직임 벡터 유도를 어렵게 한다. 본 논문은 반복 패턴에서의 움직임 추정의 정확도를 향상시키기 위해 반복 영역의 local한 특성과 global한 특성을 동시에 활용하는 semi-global한 접근을 시도한다. 움직임 벡터 후보의 히스토그램은 선거 기반 투표 시스템보다 신뢰할 수 있는 유권자 기반 투표 시스템 기반으로 형성된다. 실험 결과는 제안된 방법이 이전의 local한 접근법보다 peak signal-to-noise ratio (PSNR)와 주관적 화질 판단 관점에서 상당히 우수함을 보여준다. 비디오 프레임 보간 또는 움직임 보상 프레임율 상향 변환 (MC-FRUC)에서, 단방향 움직임 궤적에 따른 움직임 보상은 overlap과 hole 문제를 일으킨다. 본 연구에서 이러한 문제를 해결하기 위해 양방향 움직임 보상 프레임 보간을 위한 새로운 알고리즘을 제시한다. 먼저, 제안된 방법은 단방향 움직임 추정으로부터 얻어진 두 개의 단방향 움직임 영역(전방 및 후방)으로부터 양방향 움직임 벡터를 생성한다. 이는 전방 및 후방 움직임 벡터를 보간 프레임에 투영함으로써 수행된다. 보간된 블록에 여러 개의 투영된 블록이 있는 경우, 투영된 블록과 보간된 블록 사이의 거리를 확장하는 기준이 가중 계수를 계산하기 위해 제안된다. Hole은 hole이 아닌 이웃 블록의 vector median filter를 기반으로 처리된다. 제안 방법은 기존의 MC-FRUC보다 성능이 우수하며, 블록 열화를 상당히 제거한다. 본 논문에서는 CNN을 이용한 비디오 프레임 보간에 대해서도 다룬다. Optical flow 및 비디오 프레임 보간은 한 가지 문제가 다른 문제에 영향을 미치는 chicken-egg 문제로 간주된다. 본 논문에서는 중간 optical flow 를 계산하는 네트워크와 보간 프레임을 합성 하는 두 가지 네트워크로 이루어진 하나의 네트워크 스택을 구조를 제안한다. The final 보간 프레임을 생성하는 네트워크의 경우 첫 번째 네트워크의 출력인 보간 프레임 와 중간 optical flow based warped frames을 입력으로 받아서 프레임을 생성한다. 제안된 구조의 가장 큰 특징은 optical flow 계산을 위한 합성에 의한 분석법과 CNN 기반의 분석에 의한 합성법을 모두 이용하여 하나의 종합적인 framework로 결합하였다는 것이다. 제안된 네트워크는 기존의 두 가지 연구인 optical flow 기반 프레임 합성과 CNN 기반 합성 프레임 합성법을 처음 결합시킨 방식이다. 실험은 다양하고 복잡한 데이터 셋으로 이루어졌으며, 보간 프레임 quality 와 optical flow 계산 정확도 측면에서 기존의 state-of-art 방식에 비해 월등히 높은 성능을 보였다. 본 논문의 후 처리를 위한 심층 비디오 프레임 보간 네트워크는 코딩 효율 향상을 위해 최신 비디오 압축 표준인 HEVC/H.265에 적용할 수 있으며, 실험 결과는 제안 네트워크의 효율성을 입증한다.Abstract i Table of Contents iv List of Tables vii List of Figures viii Chapter 1. Introduction 1 1.1. Hierarchical Motion Estimation of Small Objects 2 1.2. Motion Estimation of a Repetition Pattern Region 4 1.3. Motion-Compensated Frame Interpolation 5 1.4. Video Frame Interpolation with Deep CNN 6 1.5. Outline of the Thesis 7 Chapter 2. Previous Works 9 2.1. Previous Works on Hierarchical Block-Based Motion Estimation 9 2.1.1. Maximum a Posterior (MAP) Framework 10 2.1.2.Hierarchical Motion Estimation 12 2.2. Previous Works on Motion Estimation for a Repetition Pattern Region 13 2.3. Previous Works on Motion Compensation 14 2.4. Previous Works on Video Frame Interpolation with Deep CNN 16 Chapter 3. Hierarchical Motion Estimation for Small Objects 19 3.1. Problem Statement 19 3.2. The Alternative Motion Vector of High Cost Pixels 20 3.3. Modified Hierarchical Motion Estimation 23 3.4. Framework of the Proposed Algorithm 24 3.5. Experimental Results 25 3.5.1. Performance Analysis 26 3.5.2. Performance Evaluation 29 Chapter 4. Semi-Global Accurate Motion Estimation for a Repetition Pattern Region 32 4.1. Problem Statement 32 4.2. Objective Function and Constrains 33 4.3. Elector based Voting System 34 4.4. Voter based Voting System 36 4.5. Experimental Results 40 Chapter 5. Multiple Motion Vectors based Motion Compensation 44 5.1. Problem Statement 44 5.2. Adaptive Weighted Multiple Motion Vectors based Motion Compensation 45 5.2.1. One-to-Multiple Motion Vector Projection 45 5.2.2. A Comprehensive Metric as the Extension of Distance 48 5.3. Handling Hole Blocks 49 5.4. Framework of the Proposed Motion Compensated Frame Interpolation 50 5.5. Experimental Results 51 Chapter 6. Video Frame Interpolation with a Stack of Deep CNN 56 6.1. Problem Statement 56 6.2. The Proposed Network for Video Frame Interpolation 57 6.2.1. A Stack of Synthesis Networks 57 6.2.2. Intermediate Optical Flow Derivation Module 60 6.2.3. Warping Operations 62 6.2.4. Training and Loss Function 63 6.2.5. Network Architecture 64 6.2.6. Experimental Results 64 6.2.6.1. Frame Interpolation Evaluation 64 6.2.6.2. Ablation Experiments 77 6.3. Extension for Quality Enhancement for Compressed Videos Task 83 6.4. Extension for Improving the Coding Efficiency of HEVC based Low Bitrate Encoder 88 Chapter 7. Conclusion 94 References 97Docto

    Video Compression using Neural Weight Step and Huffman Coding Techniques

    Get PDF
    مقدمة: تقترح هذه الورقة طريقة مخطط ضغط الفيديو الهرمي (HVCS) مع ثلاث طبقات هرمية من الجودة مع شبكة تحسين الجودة المتكررة (RQEN).  تستخدم تقنيات ضغط الصور لضغط الإطارات في الطبقة الأولى، حيث تتمتع الإطارات بأعلى جودة. باستخدام إطار عالي الجودة كمرجع ، تم اقتراح شبكة الضغط العميق ثنائي الاتجاه (BDC)  لضغط الإطار في الطبقة الثانية بجودة كبيرة. في الطبقة الثالثة، يتم استخدام جودة منخفضة لضغط الإطار باستخدام شبكة ضغط الحركة الواحدة(SMC) المعتمدة، والتي تقترح خريطة الحركة الواحدة لتقدير الحركة داخل إطارات متعددة. نتيجة لذلك ، يوفر SMC معلومات الحركة باستخدام عدد أقل من البتات. في مرحلة فك التشفير ، يتم تطوير شبكة تحسين الجودة المتكررة ((RQEN  المرجحة لأخذ كل من تدفق البتات والإطارات المضغوطة كمدخلات. في خلية RQEN ، يتم ترجيح إشارة التحديث والذاكرة باستخدام ميزات الجودة للتأثير بشكل إيجابي على معلومات الإطارات المتعددة ... طرق العمل: يوضح الجدولان 1 و 2 تمثيل القيم الناتجة لتشويه المعدل في مجموعتي بيانات الفيديو. كما ذكرنا سابقا ، يتم استخدام PSNR و MS-SSIM لتقييم الجودة، حيث يتم حساب معدلات البتات باستخدام بت لكل بكسل(bpp)  . يوضح الجدول 1 أداء PSNR، حيث يظهرون أداء PSNR أفضل لنموذج الضغط المقترح من الطرق الأخرى مثل Chao et al [7] أو الطرق المحسنة [1]. بالإضافة إلى ذلك ، يتفوقون في تطبيق H.265 على مجموعة بيانات JCT-VC القياسية. على الجانب الآخر ، أسفر مخطط الضغط المقترح عن أداء معدل بت أفضل من تطبيق H.265 على UVG. كما هو الحال في الجدول 2 ، قدم تقييم MS-SSIM أداء أفضل للمخطط المقترح من جميع النهج المستفادة الأخرى، حيث وصل إلى أداء أفضل من H.264 و .H.265 نظرا لأداء معدل البت على UVG ، يتمتع Lee et al. [11] بأداء مماثل، وحقق Guo et al [10] أداء أقل من H.265. التقديم على JCT-VC ، DVC [10] يمكن مقارنته فقط ب H.265 . على العكس من ذلك ، فإن أداء تشويه معدل HVCS له أداء أفضل واضح من H.265. علاوة على ذلك، يتم حساب معدل بت دلتا BjꝊntegaard (BDBR) [47] أيضا اعتمادا على H.265. يحسب مقياس BDBR متوسط الفرق في معدل البت مع الأخذ في الاعتبار مرساة H.265 ، حيث يشار إلى أداء أفضل على القيم المنخفضة ل BDBR [48] . يحسب مقياس BDBR متوسط الفرق في معدل البت مع الأخذ في الاعتبار مرساة H.265، حيث يشار إلى أداء أفضل على القيم المنخفضة ل BDBR [48]. في الجدول 3، يتم توضيح أداء BDBR اعتمادا على PSNR و MS-SSIM ، حيث يشار إلى تخفيض معدل البتات بالنظر إلى المرساة بأرقام سالبة معروضة. تتفوق هذه النتائج على أداء H.265، حيث تمثل الأرقام الجريئة أفضل النتائج التي تم تحقيقها من خلال الأساليب المستفادة. قدم الجدول 3 مقارنة عادلة حول تقنيات DVC المحسنة (MS-SSIM & PSNR) [10]  مع الأخذ في الاعتبار المرساة H.265. الاستنتاجات: يقترح هذا العمل مخطط ضغط فيديو مستفاد باستخدام جودة الإطار الهرمي مع التحسين المتكرر. على وجه التحديد، يقترح هذا العمل تقسيم الإطارات إلى مستويات هرمية 1 و 2 و 3 في انخفاض الجودة.  بالنسبة للطبقة الأولى، يتم اقتراح طرق ضغط الصور، مع اقتراح BDC وSMC للطبقات 2 و 3 على التوالي. تم تطوير شبكة RQEN بإطارات مضغوطة بجودة الإطار ومعلومات معدل البت كمدخلات لتحسين الإطارات المتعددة. أثبتت النتائج التجريبية كفاءة مخطط ضغط HVCS المقترح. وبالمثل مع تقنيات الضغط الأخرى ، يتم تعيين هيكل الإطار يدويا في هذا المخطط. يمكن تحقيق توصية واعدة للعمل المستقبلي من خلال تطوير شبكات DNN التي يتم تعلمها تلقائيا للتنبؤ والتسلسل الهرمي.Background: This paper proposes a Hierarchical Video Compression Scheme (HVCS) method with three hierarchical layers of quality with Recurrent Quality Enhancement (RQEN‎‎) network. Image compression techniques are used to compress frames in the first layer, where frames have the highest quality. Using high-quality frame as a reference, the Bi-Directional Deep Compression (BDC) network is proposed for frame compression in the second layer with considerable quality. In the third layer, low quality is used for frame compression using adopted Single Motion Compression (SMC) network, which proposes the single motion map for motion estimation within multiple frames. As a result, SMC provide motion information using fewer bits. In decoding stage, a weighted Recurrent Quality Enhancement (RQEN‎‎) network is developed to take both bit stream and the compressed frames as inputs. In RQEN cell‎‎, the update signal and memory are weighted using quality features to positively influence information of multi-frame for enhancement. In this paper, HVCS adopts hierarchical quality to benefit the efficiency of frame coding, whereas high-quality information improves frame compression and enhances the low-quality frames at encoding and decoding stages, respectively. Experimental results validate that proposed HVCS approach overcomes the state-of-the-art of compression methods. Materials and Methods: Tables 1& 2 illustrate representing yielded values for rate-distortion on both video datasets. As aforementioned, PSNR and MS-SSIM are used for quality evaluation, where bit-rates are calculated using bits per pixel (bpp). Table 1 illustrates PSNR performance, where they show better PSNR performance for the proposed compression model than other methods such as Chao et al [7] or optimized methods [1]. In addition, they outperform applying H.265 on standard JCT-VC dataset. On the other side, proposed compression scheme yielded better bit-rate performance than applying H.265 on UVG. As in Table 2, the MS-SSIM evaluation provided better performance of proposed scheme than all other learned approaches, where it reached better performance than H.264 and H.265. Due to bit-rate performance on UVG, Lee ‎et al. [11] has comparable performance, and Guo ‎et al [10] yielded lower performance than H.265. Applying on JCT-VC, DVC [10] is only comparable with H.265. On the opposite, the preformance of HVCS-rate-distortion have obvious better performance than H.265. Furthermore, BjꝊntegaard Delta Bit-Rate (BDBR) [47] is also computed depending on H.265. A BDBR measure computes the average difference of bit-rate considering H.265 anchor, where better performance is indicated on lower values of BDBR [48]. In Table 3, BDBR performance is illustrated depending on PSNR and MS-SSIM, in which, bit-rate reduction considering the anchor is indicated by showed negative numbers. Such results outperform H.265 performance, where bold numbers represent best yielded results by learned methods. Table 3 provided a fair comparison on (MS-SSIM & PSNR) optimized techniques DVC [10] considering the anchor H.265.   Results: Tables 1& 2 illustrate representing yielded values for rate-distortion on both video datasets. As aforementioned, PSNR and MS-SSIM are used for quality evaluation, where bit-rates are calculated using bits per pixel (bpp). Table 1 illustrates PSNR performance, where they show better PSNR performance for the proposed compression model than other methods such as Chao et al [7] or optimized methods [1]. In addition, they outperform applying H.265 on standard JCT-VC dataset. On the other side, proposed compression scheme yielded better bit-rate performance than applying H.265 on UVG. As in Table 2, the MS-SSIM evaluation provided better performance of proposed scheme than all other learned approaches, where it reached better performance than H.264 and H.265. Due to bit-rate performance on UVG, Lee ‎et al. [11] has comparable performance, and Guo ‎et al [10] yielded lower performance than H.265. Applying on JCT-VC, DVC [10] is only comparable with H.265. On the opposite, the preformance of HVCS-rate-distortion have obvious better performance than H.265. Furthermore, BjꝊntegaard Delta Bit-Rate (BDBR) [47] is also computed depending on H.265. A BDBR measure computes the average difference of bit-rate considering H.265 anchor, where better performance is indicated on lower values of BDBR [48]. In Table 3, BDBR performance is illustrated depending on PSNR and MS-SSIM, in which, bit-rate reduction considering the anchor is indicated by showed negative numbers. Such results outperform H.265 performance, where bold numbers represent best yielded results by learned methods. Table 3 provided a fair comparison on (MS-SSIM & PSNR) optimized techniques DVC [10] considering the anchor H.265. Conclusion: This work proposes a learned video compression scheme utilizing the hierarchical frame quality with recurrent enhancement. Specifically, this work proposes dividing frames into hierarchical levels 1, 2 and 3 in decreasing quality.  For the first layer, image compression methods are proposed, while proposing BDC and SMC for layers 2 and 3 respectively. RQEN‎‎ network is developed with frame quality compressed frames and bit-rate information as inputs for multi-frame enhancement. Experimental results validated the efficiency of proposed HVCS compression scheme. Similarly with other compression techniques, frame structure is manually set the in this scheme. A promising recommendation for future work can be accomplished by developing DNN networks which are automatically learned for the prediction and hierarchy

    Combining Residual Networks with LSTMs for Lipreading

    Full text link
    We propose an end-to-end deep learning architecture for word-level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We train and evaluate it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size target-words consisting of 1.28sec video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0, yielding 6.8 absolute improvement over the current state-of-the-art, without using information about word boundaries during training or testing.Comment: Submitted to Interspeech 201
    corecore