20 research outputs found

    Novel Motion Anchoring Strategies for Wavelet-based Highly Scalable Video Compression

    Full text link
    This thesis investigates new motion anchoring strategies that are targeted at wavelet-based highly scalable video compression (WSVC). We depart from two practices that are deeply ingrained in existing video compression systems. Instead of the commonly used block motion, which has poor scalability attributes, we employ piecewise-smooth motion together with a highly scalable motion boundary description. The combination of this more β€œphysical” motion description together with motion discontinuity information allows us to change the conventional strategy of anchoring motion at target frames to anchoring motion at reference frames, which improves motion inference across time. In the proposed reference-based motion anchoring strategies, motion fields are mapped from reference to target frames, where they serve as prediction references; during this mapping process, disoccluded regions are readily discovered. Observing that motion discontinuities displace with foreground objects, we propose motion-discontinuity driven motion mapping operations that handle traditionally challenging regions around moving objects. The reference-based motion anchoring exposes an intricate connection between temporal frame interpolation (TFI) and video compression. When employed in a compression system, all anchoring strategies explored in this thesis perform TFI once all residual information is quantized to zero at a given temporal level. The interpolation performance is evaluated on both natural and synthetic sequences, where we show favourable comparisons with state-of-the-art TFI schemes. We explore three reference-based motion anchoring strategies. In the first one, the motion anchoring is β€œflipped” with respect to a hierarchical B-frame structure. We develop an analytical model to determine the weights of the different spatio-temporal subbands, and assess the suitability and benefits of this reference-based WSVC for (highly scalable) video compression. Reduced motion coding cost and improved frame prediction, especially around moving objects, result in improved rate-distortion performance compared to a target-based WSVC. As the thesis evolves, the motion anchoring is progressively simplified to one where all motion is anchored at one base frame; this central motion organization facilitates the incorporation of higher-order motion models, which improve the prediction performance in regions following motion with non-constant velocity

    SpVOS: Efficient Video Object Segmentation with Triple Sparse Convolution

    Full text link
    Semi-supervised video object segmentation (Semi-VOS), which requires only annotating the first frame of a video to segment future frames, has received increased attention recently. Among existing pipelines, the memory-matching-based one is becoming the main research stream, as it can fully utilize the temporal sequence information to obtain high-quality segmentation results. Even though this type of method has achieved promising performance, the overall framework still suffers from heavy computation overhead, mainly caused by the per-frame dense convolution operations between high-resolution feature maps and each kernel filter. Therefore, we propose a sparse baseline of VOS named SpVOS in this work, which develops a novel triple sparse convolution to reduce the computation costs of the overall VOS framework. The designed triple gate, taking full consideration of both spatial and temporal redundancy between adjacent video frames, adaptively makes a triple decision to decide how to apply the sparse convolution on each pixel to control the computation overhead of each layer, while maintaining sufficient discrimination capability to distinguish similar objects and avoid error accumulation. A mixed sparse training strategy, coupled with a designed objective considering the sparsity constraint, is also developed to balance the VOS segmentation performance and computation costs. Experiments are conducted on two mainstream VOS datasets, including DAVIS and Youtube-VOS. Results show that, the proposed SpVOS achieves superior performance over other state-of-the-art sparse methods, and even maintains comparable performance, e.g., an 83.04% (79.29%) overall score on the DAVIS-2017 (Youtube-VOS) validation set, with the typical non-sparse VOS baseline (82.88% for DAVIS-2017 and 80.36% for Youtube-VOS) while saving up to 42% FLOPs, showing its application potential for resource-constrained scenarios.Comment: 15 pages, 6 figure

    A Study on Frame Prediction Method based on Operation Probability Map

    Get PDF
    λ™μ˜μƒλ‚΄μ—μ„œ 손상에 μ˜ν•΄ μ†Œμ‹€λœ ν”„λ ˆμž„μ„ λ³΅μ›ν•˜κ±°λ‚˜ 연속적인 μƒˆλ‘œμš΄ ν”„λ ˆμž„μ„ μƒμ„±ν•˜λŠ” 기법인 ν”„λ ˆμž„ μ˜ˆμΈ‘μ€ κ°μ²΄λ“€μ˜ λ™μž‘ 예츑이 ν•„μš”ν•œ μžμœ¨μ£Όν–‰, λ³΄μ•ˆ λ“±μ˜ 미래 μ£Όμš” κΈ°μˆ λ‘œμ„œ μ£Όλͺ©λ°›κ³  μžˆλ‹€. 졜근 이 κΈ°μˆ μ€ λ”₯λŸ¬λ‹ 기술과 κ²°ν•©ν•˜μ—¬ 예츑 정확도가 많이 ν–₯μƒλ˜κ³  μžˆμœΌλ‚˜ λ§Žμ€ ν•™μŠ΅λ°μ΄ν„°μ™€ μ—°μ‚°λŸ‰μ΄ 수반되기 λ•Œλ¬Έμ— μ‹€μ§ˆμ μΈ μ μš©μ—λŠ” 어렀움이 μ‘΄μž¬ν•œλ‹€. 기쑴의 λ”₯λŸ¬λ‹ 기반 예츑 λͺ¨λΈμ€ μƒˆλ‘œμš΄ ν”„λ ˆμž„ 생성 κ³Όμ •μ—μ„œ μ˜ˆμΈ‘μ— μ˜ν•΄ μƒμ„±λœ ν”„λ ˆμž„μ„ ν”Όλ“œλ°±ν•˜κΈ° λ•Œλ¬Έμ— λˆ„μ μ˜€μ°¨κ°€ 많이 λ°œμƒν•˜μ—¬ μ‹œκ°„μ΄ 지남에 따라 예츑 정확도가 κ°μ†Œν•œλ‹€. λ”°λΌμ„œ λ³Έ λ…Όλ¬Έμ—μ„œλŠ” convolution neural network (CNN)와 long short-term memory (LSTM)으둜 κ΅¬μ„±λœ λ„€νŠΈμ›Œν¬λ₯Ό 톡해 ν”„λ ˆμž„λ“€μ˜ λ™μž‘ νŠΉμ§•λ“€μ„ μΆ”μΆœν•˜κ³  νŒ¨ν„΄μ„ ν•™μŠ΅ν•˜μ—¬ λ™μž‘ ν™•λ₯  지도λ₯Ό μƒμ„±ν•˜μ—¬ μ›€μ§μž„μ΄ λ°œμƒν•œ μ˜μ—­μ— λŒ€ν•˜μ—¬ deconvolution neural network(DNN)λ₯Ό 톡해 이후 ν”„λ ˆμž„μ„ μƒμ„±ν•˜λŠ” μƒˆλ‘œμš΄ ν”„λ ˆμž„ 예츑 λͺ¨λΈμ„ μ œμ•ˆν•œλ‹€. μ œμ•ˆν•œ λͺ¨λΈμ€ CNNκ³Ό LSTM을 톡해 ν”„λ ˆμž„λ“€μ˜ λ™μž‘ νŠΉμ§•λ“€μ„ μΆ”μΆœν•˜κ³  νŒ¨ν„΄μ„ ν•™μŠ΅ν•˜μ—¬ λ™μž‘ ν™•λ₯  지도λ₯Ό μƒμ„±ν•œλ‹€. 이λ₯Ό 톡해 μž„μ˜μ˜ ν•œ ν”„λ ˆμž„μ—μ„œ λ™μž‘μ΄ λ°œμƒν•˜λŠ” μ˜μ—­λ₯Ό νŒλ³„ν•˜κ³  이 μ˜μ—­λ§Œ DNN을 톡해 μƒˆλ‘œμš΄ ν”„λ ˆμž„μ„ νšλ“ν•œλ‹€. μ΄λ•Œ ν•™μŠ΅ λ‚œμ΄λ„κ°€ 높은 DNN의 효율적인 ν•™μŠ΅μ„ μœ„ν•΄ generative adversarial nets(GAN) 기법을 μ μš©ν•œλ‹€. μ œμ•ˆλœ μƒˆλ‘œμš΄ λͺ¨λΈμ˜ ν•™μŠ΅κ³Ό 검증을 μœ„ν•˜μ—¬ λ¬΄μž‘μœ„λ‘œ 일뢀 ν”„λ ˆμž„μ΄ 제거된 λ‘œλ΄‡ μ›€μ§μž„ μ˜μƒμ„ 기반으둜 μƒμ„±λœ μ˜μƒκ³Ό 원본 μ˜μƒμ„ PSNR둜 비ꡐ λΆ„μ„ν•˜μ˜€λ‹€. κ·Έ κ²°κ³Ό, μ œμ•ˆν•œ ν”„λ ˆμž„ 예츑 λͺ¨λΈμ˜ PSNR은 35.16으둜 λΉ„κ΅ν•œ 3개의 λ‹€λ₯Έ λͺ¨λΈμ— λΉ„ν•΄ μ΅œλŒ€ 14.06이 ν–₯μƒλ˜μ—ˆλ‹€. λ˜ν•œ μƒμ„±λœ ν”„λ ˆμž„μ— λ”°λ₯Έ PSNR의 κ°μ†Œλ„ 4번째 ν”„λ ˆμž„ μ΄μ „μ—λŠ” 2, μ΄ν›„μ—λŠ” 7둜 평균 5κ°€ κ°œμ„ λ˜μ—ˆλ‹€.|Frame prediction, which is a technique to reconstruct frames lost due to damage or to generate new consecutive frames in the video, is attracting attention as a main technology which is indispensable for the autonomous vehicle and the artificial intelligence based security system that require motion prediction of objects. Recently, this technology has improved prediction accuracy in combination with deep learning technology, but it is difficulties in practical application because it involves a lot of learning data and computation amount. The existing deep learning based prediction model, since the frame generated by the prediction is feedback in the new frame generation process, is decreased the prediction accuracy over time. Therefore, in this paper, we propose an operation probability map based new frame prediction model using convolution neural network (CNN), long short-term, memory (LSTM) and deconvolution neural network(DNN) to minimize unnecessary computation regions in the frame and prediction error. The proposed model extracts the operating characteristics of the frames through CNN and LSTM and learns the patterns to generate the operation probability map. Through this process, a region in which an operation occurs is determined in one frame, and a new frame is obtained through DNN only in this region. At this time, the generative adversarial nets(GAN) technique is applied for efficient learning of DNN with the high learning complexity. For the learning and verification of the proposed new model, we compared and analyzed the generated frame and the original frame based on robotic motion images with some frames removed randomly using PSNR. As a result, the PSNR of the proposed frame prediction model is 35.16, which is 14.06 higher than the other three models. Also, the decrease of the PSNR according to the generated frame is decreased to 2 before the 4th frame and then to 7 thereafter, and is improved by 5 on the average.Chapter 1 Introduction 01 Chapter 2 Related Works 06 2.1 Convolution Neural Network 06 2.2 Long Short-Term Memory 09 2.3 Generative Adversarial Nets 12 Chapter 3 The Proposed Prediction Model 15 3.1 Structure of the proposed model 17 3.2 Model for feature extraction and operation probability estimation 21 3.3 Model for generating and combining images 24 3.4 Model for learning of generative model 27 Chapter 4 Experiment and Result 29 4.1 Dataset for learning and testing 29 4.2 Analysis of experimental results 30 Chapter 5 Conclusion 37 Reference 38Maste

    Investigating the latency cost of statistical learning of a Gaussian mixture simulating on a convolutional density network with adaptive batch size technique for background modeling

    Get PDF
    Background modeling is a promising field of study in video analysis, with a wide range of applications in video surveillance. Deep neural networks have proliferated in recent years as a result of effective learning-based approaches to motion analysis. However, these strategies only provide a partial description of the observed scenes' insufficient properties since they use a single-valued mapping to estimate the target background's temporal conditional averages. On the other hand, statistical learning in the imagery domain has become one of the most widely used approaches due to its high adaptability to dynamic context transformation, especially Gaussian Mixture Models. Specifically, these probabilistic models aim to adjust latent parameters to gain high expectation of realistically observed data; however, this approach only concentrates on contextual dynamics in short-term analysis. In a prolonged investigation, it is challenging so that statistical methods cannot reserve the generalization of long-term variation of image data. Balancing the trade-off between traditional machine learning models and deep neural networks requires an integrated approach to ensure accuracy in conception while maintaining a high speed of execution. In this research, we present a novel two-stage approach for detecting changes using two convolutional neural networks in this work. The first architecture is based on unsupervised Gaussian mixtures statistical learning, which is used to classify the salient features of scenes. The second one implements a light-weighted pipeline of foreground detection. Our two-stage system has a total of approximately 3.5K parameters but still converges quickly to complex motion patterns. Our experiments on publicly accessible datasets demonstrate that our proposed networks are not only capable of generalizing regions of moving objects with promising results in unseen scenarios, but also competitive in terms of performance quality and effectiveness foreground segmentation. Apart from modeling the data's underlying generator as a non-convex optimization problem, we briefly examine the communication cost associated with the network training by using a distributed scheme of data-parallelism to simulate a stochastic gradient descent algorithm with communication avoidance for parallel machine learnin

    Adaptive Nonlocal Signal Restoration and Enhancement Techniques for High-Dimensional Data

    Get PDF
    The large number of practical applications involving digital images has motivated a significant interest towards restoration solutions that improve the visual quality of the data under the presence of various acquisition and compression artifacts. Digital images are the results of an acquisition process based on the measurement of a physical quantity of interest incident upon an imaging sensor over a specified period of time. The quantity of interest depends on the targeted imaging application. Common imaging sensors measure the number of photons impinging over a dense grid of photodetectors in order to produce an image similar to what is perceived by the human visual system. Different applications focus on the part of the electromagnetic spectrum not visible by the human visual system, and thus require different sensing technologies to form the image. In all cases, even with the advance of technology, raw data is invariably affected by a variety of inherent and external disturbing factors, such as the stochastic nature of the measurement processes or challenging sensing conditions, which may cause, e.g., noise, blur, geometrical distortion and color aberration. In this thesis we introduce two filtering frameworks for video and volumetric data restoration based on the BM3D grouping and collaborative filtering paradigm. In its general form, the BM3D paradigm leverages the correlation present within a nonlocal emph{group} composed of mutually similar basic filtering elements, e.g., patches, to attain an enhanced sparse representation of the group in a suitable transform domain where the energy of the meaningful part of the signal can be thus separated from that of the noise through coefficient shrinkage. We argue that the success of this approach largely depends on the form of the used basic filtering elements, which in turn define the subsequent spectral representation of the nonlocal group. Thus, the main contribution of this thesis consists in tailoring specific basic filtering elements to the the inherent characteristics of the processed data at hand. Specifically, we embed the local spatial correlation present in volumetric data through 3-D cubes, and the local spatial and temporal correlation present in videos through 3-D spatiotemporal volumes, i.e. sequences of 2-D blocks following a motion trajectory. The foundational aspect of this work is the analysis of the particular spectral representation of these elements. Specifically, our frameworks stack mutually similar 3-D patches along an additional fourth dimension, thus forming a 4-D data structure. By doing so, an effective group spectral description can be formed, as the phenomena acting along different dimensions in the data can be precisely localized along different spectral hyperplanes, and thus different filtering shrinkage strategies can be applied to different spectral coefficients to achieve the desired filtering results. This constitutes a decisive difference with the shrinkage traditionally employed in BM3D-algorithms, where different hyperplanes of the group spectrum are shrunk subject to the same degradation model. Different image processing problems rely on different observation models and typically require specific algorithms to filter the corrupted data. As a consequent contribution of this thesis, we show that our high-dimensional filtering model allows to target heterogeneous noise models, e.g., characterized by spatial and temporal correlation, signal-dependent distributions, spatially varying statistics, and non-white power spectral densities, without essential modifications to the algorithm structure. As a result, we develop state-of-the-art methods for a variety of fundamental image processing problems, such as denoising, deblocking, enhancement, deflickering, and reconstruction, which also find practical applications in consumer, medical, and thermal imaging

    Exploiting Spatio-Temporal Coherence for Video Object Detection in Robotics

    Get PDF
    This paper proposes a method to enhance video object detection for indoor environments in robotics. Concretely, it exploits knowledge about the camera motion between frames to propagate previously detected objects to successive frames. The proposal is rooted in the concepts of planar homography to propose regions of interest where to find objects, and recursive Bayesian filtering to integrate observations over time. The proposal is evaluated on six virtual, indoor environments, accounting for the detection of nine object classes over a total of ∼ 7k frames. Results show that our proposal improves the recall and the F1-score by a factor of 1.41 and 1.27, respectively, as well as it achieves a significant reduction of the object categorization entropy (58.8%) when compared to a two-stage video object detection method used as baseline, at the cost of small time overheads (120 ms) and precision loss (0.92).</p

    Single view image 3D geometry estimation using self-supervised machine learning

    Get PDF
    Recovering 3D information from 2D RGB images is an essential task for many applications such as autonomous driving, robotics, and augmented reality, etc. Specifically, estimating depth information, which is lost during image formation, is a vital step for downstream tasks. With the development of deep learning, especially supervised learning, more and more researchers exploit this technique to improve depth estimation. However, supervised learning based models’ performance heavily relies on the quality of depth ground truth which is expensive to collect. In contrast to supervised learning methods, based on well-established Structure-from-Motion, self-supervised approaches only require sequential images to train depth estimation models, which transfer a depth regression task to an image reconstruction task. In this thesis, we focus on improving self-supervised monocular depth estimation. To this end, we propose several approaches: Firstly, we explore temporal geometry consistencies across consecutive frames and propose a depth loss and a pose loss. Secondly, we adopt HRNet and attention mechanism to build a novel representation network architecture DIFFNet, which significantly benefits from higher resolution input images. Thirdly, we propose a two-stage training scheme upon the existing one-stage framework by introducing a second-stage training when a self-distillation loss is optimized at the same time as the photometric loss. All of my works have been published at conferences
    corecore