8,602 research outputs found

    Enhanced Intra Prediction for Video Coding by Using Multiple Neural Networks

    Full text link
    This paper enhances the intra prediction by using multiple neural network modes (NM). Each NM serves as an end-to-end mapping from the neighboring reference blocks to the current coding block. For the provided NMs, we present two schemes (appending and substitution) to integrate the NMs with the traditional modes (TM) defined in high efficiency video coding (HEVC). For the appending scheme, each NM is corresponding to a certain range of TMs. The categorization of TMs is based on the expected prediction errors. After determining the relevant TMs for each NM, we present a probability-aware mode signaling scheme. The NMs with higher probabilities to be the best mode are signaled with fewer bits. For the substitution scheme, we propose to replace the highest and lowest probable TMs. New most probable mode (MPM) generation method is also employed when substituting the lowest probable TMs. Experimental results demonstrate that using multiple NMs will improve the coding efficiency apparently compared with the single NM. Specifically, proposed appending scheme with seven NMs can save 2.6%, 3.8%, 3.1% BD-rate for Y, U, V components compared with using single NM in the state-of-the-art works.Comment: Accepted to IEEE Transactions on Multimedi

    Deep Learning-Based Video Coding: A Review and A Case Study

    Full text link
    The past decade has witnessed great success of deep learning technology in many disciplines, especially in computer vision and image processing. However, deep learning-based video coding remains in its infancy. This paper reviews the representative works about using deep learning for image/video coding, which has been an actively developing research area since the year of 2015. We divide the related works into two categories: new coding schemes that are built primarily upon deep networks (deep schemes), and deep network-based coding tools (deep tools) that shall be used within traditional coding schemes or together with traditional coding tools. For deep schemes, pixel probability modeling and auto-encoder are the two approaches, that can be viewed as predictive coding scheme and transform coding scheme, respectively. For deep tools, there have been several proposed techniques using deep learning to perform intra-picture prediction, inter-picture prediction, cross-channel prediction, probability distribution prediction, transform, post- or in-loop filtering, down- and up-sampling, as well as encoding optimizations. In the hope of advocating the research of deep learning-based video coding, we present a case study of our developed prototype video codec, namely Deep Learning Video Coding (DLVC). DLVC features two deep tools that are both based on convolutional neural network (CNN), namely CNN-based in-loop filter (CNN-ILF) and CNN-based block adaptive resolution coding (CNN-BARC). Both tools help improve the compression efficiency by a significant margin. With the two deep tools as well as other non-deep coding tools, DLVC is able to achieve on average 39.6\% and 33.0\% bits saving than HEVC, under random-access and low-delay configurations, respectively. The source code of DLVC has been released for future researches

    Steered mixture-of-experts for light field images and video : representation and coding

    Get PDF
    Research in light field (LF) processing has heavily increased over the last decade. This is largely driven by the desire to achieve the same level of immersion and navigational freedom for camera-captured scenes as it is currently available for CGI content. Standardization organizations such as MPEG and JPEG continue to follow conventional coding paradigms in which viewpoints are discretely represented on 2-D regular grids. These grids are then further decorrelated through hybrid DPCM/transform techniques. However, these 2-D regular grids are less suited for high-dimensional data, such as LFs. We propose a novel coding framework for higher-dimensional image modalities, called Steered Mixture-of-Experts (SMoE). Coherent areas in the higher-dimensional space are represented by single higher-dimensional entities, called kernels. These kernels hold spatially localized information about light rays at any angle arriving at a certain region. The global model consists thus of a set of kernels which define a continuous approximation of the underlying plenoptic function. We introduce the theory of SMoE and illustrate its application for 2-D images, 4-D LF images, and 5-D LF video. We also propose an efficient coding strategy to convert the model parameters into a bitstream. Even without provisions for high-frequency information, the proposed method performs comparable to the state of the art for low-to-mid range bitrates with respect to subjective visual quality of 4-D LF images. In case of 5-D LF video, we observe superior decorrelation and coding performance with coding gains of a factor of 4x in bitrate for the same quality. At least equally important is the fact that our method inherently has desired functionality for LF rendering which is lacking in other state-of-the-art techniques: (1) full zero-delay random access, (2) light-weight pixel-parallel view reconstruction, and (3) intrinsic view interpolation and super-resolution

    Enhancing Quality for HEVC Compressed Videos

    Full text link
    The latest High Efficiency Video Coding (HEVC) standard has been increasingly applied to generate video streams over the Internet. However, HEVC compressed videos may incur severe quality degradation, particularly at low bit-rates. Thus, it is necessary to enhance the visual quality of HEVC videos at the decoder side. To this end, this paper proposes a Quality Enhancement Convolutional Neural Network (QE-CNN) method that does not require any modification of the encoder to achieve quality enhancement for HEVC. In particular, our QE-CNN method learns QE-CNN-I and QE-CNN-P models to reduce the distortion of HEVC I and P frames, respectively. The proposed method differs from the existing CNN-based quality enhancement approaches, which only handle intra-coding distortion and are thus not suitable for P frames. Our experimental results validate that our QE-CNN method is effective in enhancing quality for both I and P frames of HEVC videos. To apply our QE-CNN method in time-constrained scenarios, we further propose a Time-constrained Quality Enhancement Optimization (TQEO) scheme. Our TQEO scheme controls the computational time of QE-CNN to meet a target, meanwhile maximizing the quality enhancement. Next, the experimental results demonstrate the effectiveness of our TQEO scheme from the aspects of time control accuracy and quality enhancement under different time constraints. Finally, we design a prototype to implement our TQEO scheme in a real-time scenario.Comment: Submitted to IEEE T-CSV

    End to end Multi-Objective Optimisation of H.264 and HEVC Codecs

    Get PDF
    All multimedia devices now incorporate video CODECs that comply with international video coding standards such as H.264 / MPEG4-AVC and the new High Efficiency Video Coding Standard (HEVC) otherwise known as H.265. Although the standard CODECs have been designed to include algorithms with optimal efficiency, large number of coding parameters can be used to fine tune their operation, within known constraints of for e.g., available computational power, bandwidth, consumer QoS requirements, etc. With large number of such parameters involved, determining which parameters will play a significant role in providing optimal quality of service within given constraints is a further challenge that needs to be met. Further how to select the values of the significant parameters so that the CODEC performs optimally under the given constraints is a further important question to be answered. This thesis proposes a framework that uses machine learning algorithms to model the performance of a video CODEC based on the significant coding parameters. Means of modelling both the Encoder and Decoder performance is proposed. We define objective functions that can be used to model the performance related properties of a CODEC, i.e., video quality, bit-rate and CPU time. We show that these objective functions can be practically utilised in video Encoder/Decoder designs, in particular in their performance optimisation within given operational and practical constraints. A Multi-objective Optimisation framework based on Genetic Algorithms is thus proposed to optimise the performance of a video codec. The framework is designed to jointly minimize the CPU Time, Bit-rate and to maximize the quality of the compressed video stream. The thesis presents the use of this framework in the performance modelling and multi-objective optimisation of the most widely used video coding standard in practice at present, H.264 and the latest video coding standard, H.265/HEVC. When a communication network is used to transmit video, performance related parameters of the communication channel will impact the end-to-end performance of the video CODEC. Network delays and packet loss will impact the quality of the video that is received at the decoder via the communication channel, i.e., even if a video CODEC is optimally configured network conditions will make the experience sub-optimal. Given the above the thesis proposes a design, integration and testing of a novel approach to simulating a wired network and the use of UDP protocol for the transmission of video data. This network is subsequently used to simulate the impact of packet loss and network delays on optimally coded video based on the framework previously proposed for the modelling and optimisation of video CODECs. The quality of received video under different levels of packet loss and network delay is simulated, concluding the impact on transmitted video based on their content and features

    Multi-View Surveillance Video Summarization via Joint Embedding and Sparse Optimization

    Full text link
    Most traditional video summarization methods are designed to generate effective summaries for single-view videos, and thus they cannot fully exploit the complicated intra and inter-view correlations in summarizing multi-view videos in a camera network. In this paper, with the aim of summarizing multi-view videos, we introduce a novel unsupervised framework via joint embedding and sparse representative selection. The objective function is two-fold. The first is to capture the multi-view correlations via an embedding, which helps in extracting a diverse set of representatives. The second is to use a `2;1- norm to model the sparsity while selecting representative shots for the summary. We propose to jointly optimize both of the objectives, such that embedding can not only characterize the correlations, but also indicate the requirements of sparse representative selection. We present an efficient alternating algorithm based on half-quadratic minimization to solve the proposed non-smooth and non-convex objective with convergence analysis. A key advantage of the proposed approach with respect to the state-of-the-art is that it can summarize multi-view videos without assuming any prior correspondences/alignment between them, e.g., uncalibrated camera networks. Rigorous experiments on several multi-view datasets demonstrate that our approach clearly outperforms the state-of-the-art methods.Comment: IEEE Trans. on Multimedia, 2017 (In Press

    Region-Based Rate-Control for H.264/AVC for Low Bit-Rate Applications

    Full text link
    Rate-control plays an important role in video coding. However, in the conventional rate-control algorithms, the number and position of Macroblocks (MBs) inside one basic unit for rate-control is inflexible and predetermined. The different characteristics of the MBs are not fully considered. Also, there is no overall optimization of the coding of basic units. This paper proposes a new region-based rate-control scheme for H.264/AVC to improve the coding efficiency. The inter-frame information is explored to objectively divide one frame into multiple regions based on their rate-distortion behaviors. The MBs with the similar characteristics are classified into the same region, and the entire region instead of a single MB or a group of contiguous MBs is treated as a basic unit for rate-control. A linear rate-quantization stepsize model and a linear distortion-quantization stepsize model are proposed to accurately describe the rate-distortion characteristics for the region-based basic units. Moreover, based on the above linear models, an overall optimization model is proposed to obtain suitable Quantization Parameters (QPs) for the region-based basic units. Experimental results demonstrate that the proposed region-based rate-control approach can achieve both better subjective and objective quality by performing the rate-control adaptively with the content, compared to the conventional rate-control approaches.Comment: This manuscript is the accepted version for TCSVT (IEEE Transactions on Circuits and Systems for Video Technology

    Spec-ResNet: A General Audio Steganalysis scheme based on Deep Residual Network of Spectrogram

    Full text link
    The widespread application of audio and video communication technology make the compressed audio data flowing over the Internet, and make it become an important carrier for covert communication. There are many steganographic schemes emerged in the mainstream audio compression data, such as AAC and MP3, followed by many steganalysis schemes. However, these steganalysis schemes are only effective in the specific embedded domain. In this paper, a general steganalysis scheme Spec-ResNet (Deep Residual Network of Spectrogram) is proposed to detect the steganography schemes of different embedding domain for AAC and MP3. The basic idea is that the steganographic modification of different embedding domain will all introduce the change of the decoded audio signal. In this paper, the spectrogram, which is the visual representation of the spectrum of frequencies of audio signal, is adopted as the input of the feature network to extract the universal features introduced by steganography schemes; Deep Neural Network Spec-ResNet is well-designed to represent the steganalysis feature; and the features extracted from different spectrogram windows are combined to fully capture the steganalysis features. The experiment results show that the proposed scheme has good detection accuracy and generality. The proposed scheme has better detection accuracy for three different AAC steganographic schemes and MP3Stego than the state-of-arts steganalysis schemes which are based on traditional hand-crafted or CNN-based feature. To the best of our knowledge, the audio steganalysis scheme based on the spectrogram and deep residual network is first proposed in this paper. The method proposed in this paper can be extended to the audio steganalysis of other codec or audio forensics.Comment: 12 pages, 11 figures, 5 table

    Improvements of Motion Estimation and Coding using Neural Networks

    Full text link
    Inter-Prediction is used effectively in multiple standards, including H.264 and HEVC (also known as H.265). It leverages correlation between blocks of consecutive video frames in order to perform motion compensation and thus predict block pixel values and reduce transmission bandwidth. In order to reduce the magnitude of the transmitted Motion Vector (MV) and thus reduce bandwidth, the encoder utilizes Predicted Motion Vector (PMV), which is derived by taking the median vector of the corresponding MVs of the neighboring blocks. In this research, we propose innovative methods, based on neural networks prediction, for improving the accuracy of the calculated PMV. We begin by showing a straightforward approach of calculating the best matching PMV and signaling its neighbor block index value to the decoder while reducing the number of bits required to represent the result without adding any computation complexity. Then we use a classification Fully Connected Neural Networks (FCNN) to estimate from neighbors the PMV without requiring signaling and show the advantage of the approach when employed for high motion movies. We demonstrate the advantages using fast forward movies. However, the same improvements apply to camera streams of autonomous vehicles, drone cameras, Pan-Tilt-Zoom (PTZ) cameras, and similar applications whereas the MVs magnitudes are expected to be large. We also introduce a regression FCNN to predict the PMV. We calculate Huffman coded streams and demonstrate an order of ~34% reduction in number of bits required to transmit the best matching calculated PMV without reducing the quality, for fast forward movies with high motion.Comment: 11 pages, 9 figures, Submitted to IEEE Transactions on Circuits and Systems for Video Technolog

    A Comprehensive Survey on Cross-modal Retrieval

    Full text link
    In recent years, cross-modal retrieval has drawn much attention due to the rapid growth of multimodal data. It takes one type of data as the query to retrieve relevant data of another type. For example, a user can use a text to retrieve relevant pictures or videos. Since the query and its retrieved results can be of different modalities, how to measure the content similarity between different modalities of data remains a challenge. Various methods have been proposed to deal with such a problem. In this paper, we first review a number of representative methods for cross-modal retrieval and classify them into two main groups: 1) real-valued representation learning, and 2) binary representation learning. Real-valued representation learning methods aim to learn real-valued common representations for different modalities of data. To speed up the cross-modal retrieval, a number of binary representation learning methods are proposed to map different modalities of data into a common Hamming space. Then, we introduce several multimodal datasets in the community, and show the experimental results on two commonly used multimodal datasets. The comparison reveals the characteristic of different kinds of cross-modal retrieval methods, which is expected to benefit both practical applications and future research. Finally, we discuss open problems and future research directions.Comment: 20 pages, 11 figures, 9 table
    corecore