15 research outputs found

    Challenges and solutions in H.265/HEVC for integrating consumer electronics in professional video systems

    Get PDF

    Multi-frame reconstruction using super-resolution, inpainting, segmentation and codecs

    Get PDF
    In this thesis, different aspects of video and light field reconstruction are considered such as super-resolution, inpainting, segmentation and codecs. For this purpose, each of these strategies are analyzed based on a specific goal and a specific database. Accordingly, databases which are relevant to film industry, sport videos, light fields and hyperspectral videos are used for the sake of improvement. This thesis is constructed around six related manuscripts, in which several approaches are proposed for multi-frame reconstruction. Initially, a novel multi-frame reconstruction strategy is proposed for lightfield super-resolution in which graph-based regularization is applied along with edge preserving filtering for improving the spatio-angular quality of lightfield. Second, a novel video reconstruction is proposed which is built based on compressive sensing (CS), Gaussian mixture models (GMM) and sparse 3D transform-domain block matching. The motivation of the proposed technique is the improvement in visual quality performance of the video frames and decreasing the reconstruction error in comparison with the former video reconstruction methods. In the next approach, student-t mixture models and edge preserving filtering are applied for the purpose of video super-resolution. Student-t mixture model has a heavy tail which makes it robust and suitable as a video frame patch prior and rich in terms of log likelihood for information retrieval. In another approach, a hyperspectral video database is considered, and a Bayesian dictionary learning process is used for hyperspectral video super-resolution. To that end, Beta process is used in Bayesian dictionary learning and a sparse coding is generated regarding the hyperspectral video super-resolution. The spatial super-resolution is followed by a spectral video restoration strategy, and the whole process leveraged two different dictionary learnings, in which the first one is trained for spatial super-resolution and the second one is trained for the spectral restoration. Furthermore, in another approach, a novel framework is proposed for replacing advertisement contents in soccer videos in an automatic way by using deep learning strategies. For this purpose, a UNET architecture is applied (an image segmentation convolutional neural network technique) for content segmentation and detection. Subsequently, after reconstructing the segmented content in the video frames (considering the apparent loss in detection), the unwanted content is replaced by new one using a homography mapping procedure. In addition, in another research work, a novel video compression framework is presented using autoencoder networks that encode and decode videos by using less chroma information than luma information. For this purpose, instead of converting Y'CbCr 4:2:2/4:2:0 videos to and from RGB 4:4:4, the video is kept in Y'CbCr 4:2:2/4:2:0 and merged the luma and chroma channels after the luma is downsampled to match the chroma size. An inverse function is performed for the decoder. The performance of these models is evaluated by using CPSNR, MS-SSIM, and VMAF metrics. The experiments reveal that, as compared to video compression involving conversion to and from RGB 4:4:4, the proposed method increases the video quality by about 5.5% for Y'CbCr 4:2:2 and 8.3% for Y'CbCr 4:2:0 while reducing the amount of computation by nearly 37% for Y'CbCr 4:2:2 and 40% for Y'CbCr 4:2:0. The thread that ties these approaches together is reconstruction of the video and light field frames based on different aspects of problems such as having loss of information, blur in the frames, existing noise after reconstruction, existing unpleasant content, excessive size of information and high computational overhead. In three of the proposed approaches, we have used Plug-and-Play ADMM model for the first time regarding reconstruction of videos and light fields in order to address both information retrieval in the frames and tackling noise/blur at the same time. In two of the proposed models, we applied sparse dictionary learning to reduce the data dimension and demonstrate them as an efficient linear combination of basis frame patches. Two of the proposed approaches are developed in collaboration with industry, in which deep learning frameworks are used to handle large set of features and to learn high-level features from the data

    Understanding perceived quality through visual representations

    Get PDF
    The formatting of images can be considered as an optimization problem, whose cost function is a quality assessment algorithm. There is a trade-off between bit budget per pixel and quality. To maximize the quality and minimize the bit budget, we need to measure the perceived quality. In this thesis, we focus on understanding perceived quality through visual representations that are based on visual system characteristics and color perception mechanisms. Specifically, we use the contrast sensitivity mechanisms in retinal ganglion cells and the suppression mechanisms in cortical neurons. We utilize color difference equations and color name distances to mimic pixel-wise color perception and a bio-inspired model to formulate center surround effects. Based on these formulations, we introduce two novel image quality estimators PerSIM and CSV, and a new image quality-assistance method BLeSS. We combine our findings from visual system and color perception with data-driven methods to generate visual representations and measure their quality. The majority of existing data-driven methods require subjective scores or degraded images. In contrast, we follow an unsupervised approach that only utilizes generic images. We introduce a novel unsupervised image quality estimator UNIQUE, and extend it with multiple models and layers to obtain MS-UNIQUE and DMS-UNIQUE. In addition to introducing quality estimators, we analyze the role of spatial pooling and boosting in image quality assessment.Ph.D

    Algorithms and Hardware Co-Design of HEVC Intra Encoders

    Get PDF
    Digital video is becoming extremely important nowadays and its importance has greatly increased in the last two decades. Due to the rapid development of information and communication technologies, the demand for Ultra-High Definition (UHD) video applications is becoming stronger. However, the most prevalent video compression standard H.264/AVC released in 2003 is inefficient when it comes to UHD videos. The increasing desire for superior compression efficiency to H.264/AVC leads to the standardization of High Efficiency Video Coding (HEVC). Compared with the H.264/AVC standard, HEVC offers a double compression ratio at the same level of video quality or substantial improvement of video quality at the same video bitrate. Yet, HE-VC/H.265 possesses superior compression efficiency, its complexity is several times more than H.264/AVC, impeding its high throughput implementation. Currently, most of the researchers have focused merely on algorithm level adaptations of HEVC/H.265 standard to reduce computational intensity without considering the hardware feasibility. What’s more, the exploration of efficient hardware architecture design is not exhaustive. Only a few research works have been conducted to explore efficient hardware architectures of HEVC/H.265 standard. In this dissertation, we investigate efficient algorithm adaptations and hardware architecture design of HEVC intra encoders. We also explore the deep learning approach in mode prediction. From the algorithm point of view, we propose three efficient hardware-oriented algorithm adaptations, including mode reduction, fast coding unit (CU) cost estimation, and group-based CABAC (context-adaptive binary arithmetic coding) rate estimation. Mode reduction aims to reduce mode candidates of each prediction unit (PU) in the rate-distortion optimization (RDO) process, which is both computation-intensive and time-consuming. Fast CU cost estimation is applied to reduce the complexity in rate-distortion (RD) calculation of each CU. Group-based CABAC rate estimation is proposed to parallelize syntax elements processing to greatly improve rate estimation throughput. From the hardware design perspective, a fully parallel hardware architecture of HEVC intra encoder is developed to sustain UHD video compression at 4K@30fps. The fully parallel architecture introduces four prediction engines (PE) and each PE performs the full cycle of mode prediction, transform, quantization, inverse quantization, inverse transform, reconstruction, rate-distortion estimation independently. PU blocks with different PU sizes will be processed by the different prediction engines (PE) simultaneously. Also, an efficient hardware implementation of a group-based CABAC rate estimator is incorporated into the proposed HEVC intra encoder for accurate and high-throughput rate estimation. To take advantage of the deep learning approach, we also propose a fully connected layer based neural network (FCLNN) mode preselection scheme to reduce the number of RDO modes of luma prediction blocks. All angular prediction modes are classified into 7 prediction groups. Each group contains 3-5 prediction modes that exhibit a similar prediction angle. A rough angle detection algorithm is designed to determine the prediction direction of the current block, then a small scale FCLNN is exploited to refine the mode prediction

    Minimizing Computational Resources for Deep Machine Learning: A Compression and Neural Architecture Search Perspective for Image Classification and Object Detection

    Get PDF
    Computational resources represent a significant bottleneck across all current deep learning computer vision approaches. Image and video data storage requirements for training deep neural networks have led to the widespread use of image and video compression, the use of which naturally impacts the performance of neural network architectures during both training and inference. The prevalence of deep neural networks deployed on edge devices necessitates efficient network architecture design, while training neural networks requires significant time and computational resources, despite the acceleration of both hardware and software developments within the field of artificial intelligence (AI). This thesis addresses these challenges in order to minimize computational resource requirements across the entire end-to-end deep learning pipeline. We determine the extent to which data compression impacts neural network architecture performance, and by how much this performance can be recovered by retraining neural networks with compressed data. The thesis then focuses on the accessibility of the deployment of neural architecture search (NAS) to facilitate automatic network architecture generation for image classification suited to resource-constrained environments. A combined hard example mining and curriculum learning strategy is developed to minimize the image data processed during a given training epoch within the NAS search phase, without diminishing performance. We demonstrate the capability of the proposed framework across all gradient-based, reinforcement learning, and evolutionary NAS approaches, and a simple but effective method to extend the approach to the prediction-based NAS paradigm. The hard example mining approach within the proposed NAS framework depends upon the effectiveness of an autoencoder to regulate the latent space such that similar images have similar feature embeddings. This thesis conducts a thorough investigation to satisfy this constraint within the context of image classification. Based upon the success of the overall proposed NAS framework, we subsequently extend the approach towards object detection. Despite the resultant multi-label domain presenting a more difficult challenge for hard example mining, we propose an extension to the autoencoder to capture the additional object location information encoded within the training labels. The generation of an implicit attention layer within the autoencoder network sufficiently improves its capability to enforce similar images to have similar embeddings, thus successfully transferring the proposed NAS approach to object detection. Finally, the thesis demonstrates the resilience to compression of the general two-stage NAS approach upon which our proposed NAS framework is based

    Adapting Computer Vision Models To Limitations On Input Dimensionality And Model Complexity

    Get PDF
    When considering instances of distributed systems where visual sensors communicate with remote predictive models, data traffic is limited to the capacity of communication channels, and hardware limits the processing of collected data prior to transmission. We study novel methods of adapting visual inference to limitations on complexity and data availability at test time, wherever the aforementioned limitations exist. Our contributions detailed in this thesis consider both task-specific and task-generic approaches to reducing the data requirement for inference, and evaluate our proposed methods on a wide range of computer vision tasks. This thesis makes four distinct contributions: (i) We investigate multi-class action classification via two-stream convolutional neural networks that directly ingest information extracted from compressed video bitstreams. We show that selective access to macroblock motion vector information provides a good low-dimensional approximation of the underlying optical flow in visual sequences. (ii) We devise a bitstream cropping method by which AVC/H.264 and H.265 bitstreams are reduced to the minimum amount of necessary elements for optical flow extraction, while maintaining compliance with codec standards. We additionally study the effect of codec rate-quality control on the sparsity and noise incurred on optical flow derived from resulting bitstreams, and do so for multiple coding standards. (iii) We demonstrate degrees of variability in the amount of data required for action classification, and leverage this to reduce the dimensionality of input volumes by inferring the required temporal extent for accurate classification prior to processing via learnable machines. (iv) We extend the Mixtures-of-Experts (MoE) paradigm to adapt the data cost of inference for any set of constituent experts. We postulate that the minimum acceptable data cost of inference varies for different input space partitions, and consider mixtures where each expert is designed to meet a different set of constraints on input dimensionality. To take advantage of the flexibility of such mixtures in processing different input representations and modalities, we train biased gating functions such that experts requiring less information to make their inferences are favoured to others. We finally note that, our proposed data utility optimization solutions include a learnable component which considers specified priorities on the amount of information to be used prior to inference, and can be realized for any combination of tasks, modalities, and constraints on available data

    Novi algoritam za kompresiju seizmičkih podataka velike amplitudske rezolucije

    Get PDF
    Renewable sources cannot meet energy demand of a growing global market. Therefore, it is expected that oil & gas will remain a substantial sources of energy in a coming years. To find a new oil & gas deposits that would satisfy growing global energy demands, significant efforts are constantly involved in finding ways to increase efficiency of a seismic surveys. It is commonly considered that, in an initial phase of exploration and production of a new fields, high-resolution and high-quality images of the subsurface are of the great importance. As one part in the seismic data processing chain, efficient managing and delivering of a large data sets, that are vastly produced by the industry during seismic surveys, becomes extremely important in order to facilitate further seismic data processing and interpretation. In this respect, efficiency to a large extent relies on the efficiency of the compression scheme, which is often required to enable faster transfer and access to data, as well as efficient data storage. Motivated by the superior performance of High Efficiency Video Coding (HEVC), and driven by the rapid growth in data volume produced by seismic surveys, this work explores a 32 bits per pixel (b/p) extension of the HEVC codec for compression of seismic data. It is proposed to reassemble seismic slices in a format that corresponds to video signal and benefit from the coding gain achieved by HEVC inter mode, besides the possible advantages of the (still image) HEVC intra mode. To this end, this work modifies almost all components of the original HEVC codec to cater for high bit-depth coding of seismic data: Lagrange multiplier used in optimization of the coding parameters has been adapted to the new data statistics, core transform and quantization have been reimplemented to handle the increased bit-depth range, and modified adaptive binary arithmetic coder has been employed for efficient entropy coding. In addition, optimized block selection, reduced intra prediction modes, and flexible motion estimation are tested to adapt to the structure of seismic data. Even though the new codec after implementation of the proposed modifications goes beyond the standardized HEVC, it still maintains a generic HEVC structure, and it is developed under the general HEVC framework. There is no similar work in the field of the seismic data compression that uses the HEVC as a base codec setting. Thus, a specific codec design has been tailored which, when compared to the JPEG-XR and commercial wavelet-based codec, significantly improves the peak-signal-tonoise- ratio (PSNR) vs. compression ratio performance for 32 b/p seismic data. Depending on a proposed configurations, PSNR gain goes from 3.39 dB up to 9.48 dB. Also, relying on the specific characteristics of seismic data, an optimized encoder is proposed in this work. It reduces encoding time by 67.17% for All-I configuration on trace image dataset, and 67.39% for All-I, 97.96% for P2-configuration and 98.64% for B-configuration on 3D wavefield dataset, with negligible coding performance losses. As a side contribution of this work, HEVC is analyzed within all of its functional units, so that the presented work itself can serve as a specific overview of methods incorporated into the standard

    Energy efficient enabling technologies for semantic video processing on mobile devices

    Get PDF
    Semantic object-based processing will play an increasingly important role in future multimedia systems due to the ubiquity of digital multimedia capture/playback technologies and increasing storage capacity. Although the object based paradigm has many undeniable benefits, numerous technical challenges remain before the applications becomes pervasive, particularly on computational constrained mobile devices. A fundamental issue is the ill-posed problem of semantic object segmentation. Furthermore, on battery powered mobile computing devices, the additional algorithmic complexity of semantic object based processing compared to conventional video processing is highly undesirable both from a real-time operation and battery life perspective. This thesis attempts to tackle these issues by firstly constraining the solution space and focusing on the human face as a primary semantic concept of use to users of mobile devices. A novel face detection algorithm is proposed, which from the outset was designed to be amenable to be offloaded from the host microprocessor to dedicated hardware, thereby providing real-time performance and reducing power consumption. The algorithm uses an Artificial Neural Network (ANN), whose topology and weights are evolved via a genetic algorithm (GA). The computational burden of the ANN evaluation is offloaded to a dedicated hardware accelerator, which is capable of processing any evolved network topology. Efficient arithmetic circuitry, which leverages modified Booth recoding, column compressors and carry save adders, is adopted throughout the design. To tackle the increased computational costs associated with object tracking or object based shape encoding, a novel energy efficient binary motion estimation architecture is proposed. Energy is reduced in the proposed motion estimation architecture by minimising the redundant operations inherent in the binary data. Both architectures are shown to compare favourable with the relevant prior art
    corecore