Search CORE

20 research outputs found

Novel Motion Anchoring Strategies for Wavelet-based Highly Scalable Video Compression

Author: Ruefenacht Dominic
Publication venue: UNSW, Sydney
Publication date: 01/01/2017
Field of study

This thesis investigates new motion anchoring strategies that are targeted at wavelet-based highly scalable video compression (WSVC). We depart from two practices that are deeply ingrained in existing video compression systems. Instead of the commonly used block motion, which has poor scalability attributes, we employ piecewise-smooth motion together with a highly scalable motion boundary description. The combination of this more “physical” motion description together with motion discontinuity information allows us to change the conventional strategy of anchoring motion at target frames to anchoring motion at reference frames, which improves motion inference across time. In the proposed reference-based motion anchoring strategies, motion fields are mapped from reference to target frames, where they serve as prediction references; during this mapping process, disoccluded regions are readily discovered. Observing that motion discontinuities displace with foreground objects, we propose motion-discontinuity driven motion mapping operations that handle traditionally challenging regions around moving objects. The reference-based motion anchoring exposes an intricate connection between temporal frame interpolation (TFI) and video compression. When employed in a compression system, all anchoring strategies explored in this thesis perform TFI once all residual information is quantized to zero at a given temporal level. The interpolation performance is evaluated on both natural and synthetic sequences, where we show favourable comparisons with state-of-the-art TFI schemes. We explore three reference-based motion anchoring strategies. In the first one, the motion anchoring is “flipped” with respect to a hierarchical B-frame structure. We develop an analytical model to determine the weights of the different spatio-temporal subbands, and assess the suitability and benefits of this reference-based WSVC for (highly scalable) video compression. Reduced motion coding cost and improved frame prediction, especially around moving objects, result in improved rate-distortion performance compared to a target-based WSVC. As the thesis evolves, the motion anchoring is progressively simplified to one where all motion is anchored at one base frame; this central motion organization facilitates the incorporation of higher-order motion models, which improve the prediction performance in regions following motion with non-constant velocity

Crossref

UNSWorks

SpVOS: Efficient Video Object Segmentation with Triple Sparse Convolution

Author: Chen Tao
Lin Weihao
Yu Chong
Publication venue
Publication date: 23/10/2023
Field of study

Semi-supervised video object segmentation (Semi-VOS), which requires only annotating the first frame of a video to segment future frames, has received increased attention recently. Among existing pipelines, the memory-matching-based one is becoming the main research stream, as it can fully utilize the temporal sequence information to obtain high-quality segmentation results. Even though this type of method has achieved promising performance, the overall framework still suffers from heavy computation overhead, mainly caused by the per-frame dense convolution operations between high-resolution feature maps and each kernel filter. Therefore, we propose a sparse baseline of VOS named SpVOS in this work, which develops a novel triple sparse convolution to reduce the computation costs of the overall VOS framework. The designed triple gate, taking full consideration of both spatial and temporal redundancy between adjacent video frames, adaptively makes a triple decision to decide how to apply the sparse convolution on each pixel to control the computation overhead of each layer, while maintaining sufficient discrimination capability to distinguish similar objects and avoid error accumulation. A mixed sparse training strategy, coupled with a designed objective considering the sparsity constraint, is also developed to balance the VOS segmentation performance and computation costs. Experiments are conducted on two mainstream VOS datasets, including DAVIS and Youtube-VOS. Results show that, the proposed SpVOS achieves superior performance over other state-of-the-art sparse methods, and even maintains comparable performance, e.g., an 83.04% (79.29%) overall score on the DAVIS-2017 (Youtube-VOS) validation set, with the typical non-sparse VOS baseline (82.88% for DAVIS-2017 and 80.36% for Youtube-VOS) while saving up to 42% FLOPs, showing its application potential for resource-constrained scenarios.Comment: 15 pages, 6 figure

arXiv.org e-Print Archive

A Study on Frame Prediction Method based on Operation Probability Map

Author: 이수환
Publication venue: 한국해양대학교 대학원
Publication date: 01/02/2018
Field of study

동영상내에서 손상에 의해 소실된 프레임을 복원하거나 연속적인 새로운 프레임을 생성하는 기법인 프레임 예측은 객체들의 동작 예측이 필요한 자율주행, 보안 등의 미래 주요 기술로서 주목받고 있다. 최근 이 기술은 딥러닝 기술과 결합하여 예측 정확도가 많이 향상되고 있으나 많은 학습데이터와 연산량이 수반되기 때문에 실질적인 적용에는 어려움이 존재한다. 기존의 딥러닝 기반 예측 모델은 새로운 프레임 생성 과정에서 예측에 의해 생성된 프레임을 피드백하기 때문에 누적오차가 많이 발생하여 시간이 지남에 따라 예측 정확도가 감소한다. 따라서 본 논문에서는 convolution neural network (CNN)와 long short-term memory (LSTM)으로 구성된 네트워크를 통해 프레임들의 동작 특징들을 추출하고 패턴을 학습하여 동작 확률 지도를 생성하여 움직임이 발생한 영역에 대하여 deconvolution neural network(DNN)를 통해 이후 프레임을 생성하는 새로운 프레임 예측 모델을 제안한다. 제안한 모델은 CNN과 LSTM을 통해 프레임들의 동작 특징들을 추출하고 패턴을 학습하여 동작 확률 지도를 생성한다. 이를 통해 임의의 한 프레임에서 동작이 발생하는 영역를 판별하고 이 영역만 DNN을 통해 새로운 프레임을 획득한다. 이때 학습 난이도가 높은 DNN의 효율적인 학습을 위해 generative adversarial nets(GAN) 기법을 적용한다. 제안된 새로운 모델의 학습과 검증을 위하여 무작위로 일부 프레임이 제거된 로봇 움직임 영상을 기반으로 생성된 영상과 원본 영상을 PSNR로 비교 분석하였다. 그 결과, 제안한 프레임 예측 모델의 PSNR은 35.16으로 비교한 3개의 다른 모델에 비해 최대 14.06이 향상되었다. 또한 생성된 프레임에 따른 PSNR의 감소도 4번째 프레임 이전에는 2, 이후에는 7로 평균 5가 개선되었다.|Frame prediction, which is a technique to reconstruct frames lost due to damage or to generate new consecutive frames in the video, is attracting attention as a main technology which is indispensable for the autonomous vehicle and the artificial intelligence based security system that require motion prediction of objects. Recently, this technology has improved prediction accuracy in combination with deep learning technology, but it is difficulties in practical application because it involves a lot of learning data and computation amount. The existing deep learning based prediction model, since the frame generated by the prediction is feedback in the new frame generation process, is decreased the prediction accuracy over time. Therefore, in this paper, we propose an operation probability map based new frame prediction model using convolution neural network (CNN), long short-term, memory (LSTM) and deconvolution neural network(DNN) to minimize unnecessary computation regions in the frame and prediction error. The proposed model extracts the operating characteristics of the frames through CNN and LSTM and learns the patterns to generate the operation probability map. Through this process, a region in which an operation occurs is determined in one frame, and a new frame is obtained through DNN only in this region. At this time, the generative adversarial nets(GAN) technique is applied for efficient learning of DNN with the high learning complexity. For the learning and verification of the proposed new model, we compared and analyzed the generated frame and the original frame based on robotic motion images with some frames removed randomly using PSNR. As a result, the PSNR of the proposed frame prediction model is 35.16, which is 14.06 higher than the other three models. Also, the decrease of the PSNR according to the generated frame is decreased to 2 before the 4th frame and then to 7 thereafter, and is improved by 5 on the average.Chapter 1 Introduction 01 Chapter 2 Related Works 06 2.1 Convolution Neural Network 06 2.2 Long Short-Term Memory 09 2.3 Generative Adversarial Nets 12 Chapter 3 The Proposed Prediction Model 15 3.1 Structure of the proposed model 17 3.2 Model for feature extraction and operation probability estimation 21 3.3 Model for generating and combining images 24 3.4 Model for learning of generative model 27 Chapter 4 Experiment and Result 29 4.1 Dataset for learning and testing 29 4.2 Analysis of experimental results 30 Chapter 5 Conclusion 37 Reference 38Maste

한국해양대학교(KMOU)

Investigating the latency cost of statistical learning of a Gaussian mixture simulating on a convolutional density network with adaptive batch size technique for background modeling

Author: Phan Hung Ngoc
Publication venue: 'UiT The Arctic University of Norway'
Publication date: 01/01/2021
Field of study

Background modeling is a promising field of study in video analysis, with a wide range of applications in video surveillance. Deep neural networks have proliferated in recent years as a result of effective learning-based approaches to motion analysis. However, these strategies only provide a partial description of the observed scenes' insufficient properties since they use a single-valued mapping to estimate the target background's temporal conditional averages. On the other hand, statistical learning in the imagery domain has become one of the most widely used approaches due to its high adaptability to dynamic context transformation, especially Gaussian Mixture Models. Specifically, these probabilistic models aim to adjust latent parameters to gain high expectation of realistically observed data; however, this approach only concentrates on contextual dynamics in short-term analysis. In a prolonged investigation, it is challenging so that statistical methods cannot reserve the generalization of long-term variation of image data. Balancing the trade-off between traditional machine learning models and deep neural networks requires an integrated approach to ensure accuracy in conception while maintaining a high speed of execution. In this research, we present a novel two-stage approach for detecting changes using two convolutional neural networks in this work. The first architecture is based on unsupervised Gaussian mixtures statistical learning, which is used to classify the salient features of scenes. The second one implements a light-weighted pipeline of foreground detection. Our two-stage system has a total of approximately 3.5K parameters but still converges quickly to complex motion patterns. Our experiments on publicly accessible datasets demonstrate that our proposed networks are not only capable of generalizing regions of moving objects with promising results in unseen scenarios, but also competitive in terms of performance quality and effectiveness foreground segmentation. Apart from modeling the data's underlying generator as a non-convex optimization problem, we briefly examine the communication cost associated with the network training by using a distributed scheme of data-parallelism to simulate a stochastic gradient descent algorithm with communication avoidance for parallel machine learnin

Munin - Open Research Archive

NORA - Norwegian Open Research Archives

Adaptive Nonlocal Signal Restoration and Enhancement Techniques for High-Dimensional Data

Author: Maggioni Matteo
Publication venue: Tampere University of Technology
Publication date: 01/01/2015
Field of study

The large number of practical applications involving digital images has motivated a significant interest towards restoration solutions that improve the visual quality of the data under the presence of various acquisition and compression artifacts. Digital images are the results of an acquisition process based on the measurement of a physical quantity of interest incident upon an imaging sensor over a specified period of time. The quantity of interest depends on the targeted imaging application. Common imaging sensors measure the number of photons impinging over a dense grid of photodetectors in order to produce an image similar to what is perceived by the human visual system. Different applications focus on the part of the electromagnetic spectrum not visible by the human visual system, and thus require different sensing technologies to form the image. In all cases, even with the advance of technology, raw data is invariably affected by a variety of inherent and external disturbing factors, such as the stochastic nature of the measurement processes or challenging sensing conditions, which may cause, e.g., noise, blur, geometrical distortion and color aberration. In this thesis we introduce two filtering frameworks for video and volumetric data restoration based on the BM3D grouping and collaborative filtering paradigm. In its general form, the BM3D paradigm leverages the correlation present within a nonlocal emph{group} composed of mutually similar basic filtering elements, e.g., patches, to attain an enhanced sparse representation of the group in a suitable transform domain where the energy of the meaningful part of the signal can be thus separated from that of the noise through coefficient shrinkage. We argue that the success of this approach largely depends on the form of the used basic filtering elements, which in turn define the subsequent spectral representation of the nonlocal group. Thus, the main contribution of this thesis consists in tailoring specific basic filtering elements to the the inherent characteristics of the processed data at hand. Specifically, we embed the local spatial correlation present in volumetric data through 3-D cubes, and the local spatial and temporal correlation present in videos through 3-D spatiotemporal volumes, i.e. sequences of 2-D blocks following a motion trajectory. The foundational aspect of this work is the analysis of the particular spectral representation of these elements. Specifically, our frameworks stack mutually similar 3-D patches along an additional fourth dimension, thus forming a 4-D data structure. By doing so, an effective group spectral description can be formed, as the phenomena acting along different dimensions in the data can be precisely localized along different spectral hyperplanes, and thus different filtering shrinkage strategies can be applied to different spectral coefficients to achieve the desired filtering results. This constitutes a decisive difference with the shrinkage traditionally employed in BM3D-algorithms, where different hyperplanes of the group spectrum are shrunk subject to the same degradation model. Different image processing problems rely on different observation models and typically require specific algorithms to filter the corrupted data. As a consequent contribution of this thesis, we show that our high-dimensional filtering model allows to target heterogeneous noise models, e.g., characterized by spatial and temporal correlation, signal-dependent distributions, spatially varying statistics, and non-white power spectral densities, without essential modifications to the algorithm structure. As a result, we develop state-of-the-art methods for a variety of fundamental image processing problems, such as denoising, deblocking, enhancement, deflickering, and reconstruction, which also find practical applications in consumer, medical, and thermal imaging

Trepo - Institutional Repository of Tampere University

Exploiting Spatio-Temporal Coherence for Video Object Detection in Robotics

Author: Fernandez-Chaves David
Gonzalez-Jimenez Javier
Matez-Bandera Jose Luis
Monroy Javier
Petkov Nicolai
Ruiz-Sarmiento Jose Raul
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

This paper proposes a method to enhance video object detection for indoor environments in robotics. Concretely, it exploits knowledge about the camera motion between frames to propagate previously detected objects to successive frames. The proposal is rooted in the concepts of planar homography to propose regions of interest where to find objects, and recursive Bayesian filtering to integrate observations over time. The proposal is evaluated on six virtual, indoor environments, accounting for the detection of nine object classes over a total of ∼ 7k frames. Results show that our proposal improves the recall and the F1-score by a factor of 1.41 and 1.27, respectively, as well as it achieves a significant reduction of the object categorization entropy (58.8%) when compared to a two-stage video object detection method used as baseline, at the cost of small time overheads (120 ms) and precision loss (0.92).</p

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

Single view image 3D geometry estimation using self-supervised machine learning

Author: Zhou Hang
Publication venue: University of East Anglia. School of Computing Sciences
Publication date: 01/06/2024
Field of study

Recovering 3D information from 2D RGB images is an essential task for many applications such as autonomous driving, robotics, and augmented reality, etc. Specifically, estimating depth information, which is lost during image formation, is a vital step for downstream tasks. With the development of deep learning, especially supervised learning, more and more researchers exploit this technique to improve depth estimation. However, supervised learning based models’ performance heavily relies on the quality of depth ground truth which is expensive to collect. In contrast to supervised learning methods, based on well-established Structure-from-Motion, self-supervised approaches only require sequential images to train depth estimation models, which transfer a depth regression task to an image reconstruction task. In this thesis, we focus on improving self-supervised monocular depth estimation. To this end, we propose several approaches: Firstly, we explore temporal geometry consistencies across consecutive frames and propose a depth loss and a pose loss. Secondly, we adopt HRNet and attention mechanism to build a novel representation network architecture DIFFNet, which significantly benefits from higher resolution input images. Thirdly, we propose a two-stage training scheme upon the existing one-stage framework by introducing a second-stage training when a self-distillation loss is optimized at the same time as the photometric loss. All of my works have been published at conferences

University of East Anglia digital repository