126 research outputs found

    Two-Stage Overfitting of Neural Network-Based Video Coding In-Loop Filter

    Get PDF
    Modern video coding standards like the Versatile Video Coding (VVC) produce compression artefacts, due to their block-based, lossy compression techniques. These artefacts are mitigated to an extent by the in-loop filters inside the coding process. Neural Network (NN) based in-loop filters are being explored for the denoising tasks, and in recent studies, these NN-based loop filters are overfitted on test content to achieve a content-adaptive nature, and further enhance the visual quality of the video frames, while balancing the trade-off between quality and bitrate. This loop filter is a relatively low-complexity Convolutional Neural Network (CNN) that is pretrained on a general video dataset and then fine-tuned on the video that needs to be encoded. Only a set of parameters inside the CNN architecture, named multipliers, are fine-tuned, thus the bitrate overhead, that is signalled to the decoder, is minimized. The created weight update is compressed using the Neural Network Compression and Representation (NNR) standard. In this project, an exploration of high-performing hyperparameters was conducted, and the two-stage training process was employed to, potentially, further increase the coding efficiency of the in-loop filter. A first-stage model was overfitted on the test video sequence, it explored on which patches of the dataset it could improve the quality of the unfiltered video data, and then the second-stage model was overfitted only on these patches that provided a gain. The model with best-found hyperparameters saved on average 1.01% (Y), 4.28% (Cb), and 3.61% (Cr) Bjontegaard Delta rate (BD-rate) compared to the Versatile Video Coding (VVC) Test Model (VTM) 11.0 NN-based Video Coding (NNVC) 5.0, Random Access (RA) Common Test Conditions (CTC). The second-stage model, although exceeded the VTM, it underperformed with about 0.20% (Y), 0.23% (Cb), and 0.18% (Cr) BD-rate with regards to the first-stage model, due to the high bitrate overhead created by the second-stage model

    Learning-based Wavelet-like Transforms For Fully Scalable and Accessible Image Compression

    Full text link
    The goal of this thesis is to improve the existing wavelet transform with the aid of machine learning techniques, so as to enhance coding efficiency of wavelet-based image compression frameworks, such as JPEG 2000. In this thesis, we first propose to augment the conventional base wavelet transform with two additional learned lifting steps -- a high-to-low step followed by a low-to-high step. The high-to-low step suppresses aliasing in the low-pass band by using the detail bands at the same resolution, while the low-to-high step aims to further remove redundancy from detail bands by using the corresponding low-pass band. These two additional steps reduce redundancy (notably aliasing information) amongst the wavelet subbands, and also improve the visual quality of reconstructed images at reduced resolutions. To train these two networks in an end-to-end fashion, we develop a backward annealing approach to overcome the non-differentiability of the quantization and cost functions during back-propagation. Importantly, the two additional networks share a common architecture, named a proposal-opacity topology, which is inspired and guided by a specific theoretical argument related to geometric flow. This particular network topology is compact and with limited non-linearities, allowing a fully scalable system; one pair of trained network parameters are applied for all levels of decomposition and for all bit-rates of interest. By employing the additional lifting networks within the JPEG2000 image coding standard, we can achieve up to 17.4% average BD bit-rate saving over a wide range of bit-rates, while retaining the quality and resolution scalability features of JPEG2000. Built upon the success of the high-to-low and low-to-high steps, we then study more broadly the extension of neural networks to all lifting steps that correspond to the base wavelet transform. The purpose of this comprehensive study is to understand what is the most effective way to develop learned wavelet-like transforms for highly scalable and accessible image compression. Specifically, we examine the impact of the number of learned lifting steps, the number of layers and the number of channels in each learned lifting network, and kernel support in each layer. To facilitate the study, we develop a generic training methodology that is simultaneously appropriate to all lifting structures considered. Experimental results ultimately suggest that to improve the existing wavelet transform, it is more profitable to augment a larger wavelet transform with more diverse high-to-low and low-to-high steps, rather than developing deep fully learned lifting structures

    Situating Data: Inquiries in Algorithmic Culture

    Get PDF
    Taking up the challenges of the datafication of culture, as well as of the scholarship of cultural inquiry itself, this collection contributes to the critical debate about data and algorithms. How can we understand the quality and significance of current socio-technical transformations that result from datafication and algorithmization? How can we explore the changing conditions and contours for living within such new and changing frameworks? How can, or should we, think and act within, but also in response to these conditions? This collection brings together various perspectives on the datafication and algorithmization of culture from debates and disciplines within the field of cultural inquiry, specifically (new) media studies, game studies, urban studies, screen studies, and gender and postcolonial studies. It proposes conceptual and methodological directions for exploring where, when, and how data and algorithms (re)shape cultural practices, create (in)justice, and (co)produce knowledge

    From Capture to Display: A Survey on Volumetric Video

    Full text link
    Volumetric video, which offers immersive viewing experiences, is gaining increasing prominence. With its six degrees of freedom, it provides viewers with greater immersion and interactivity compared to traditional videos. Despite their potential, volumetric video services poses significant challenges. This survey conducts a comprehensive review of the existing literature on volumetric video. We firstly provide a general framework of volumetric video services, followed by a discussion on prerequisites for volumetric video, encompassing representations, open datasets, and quality assessment metrics. Then we delve into the current methodologies for each stage of the volumetric video service pipeline, detailing capturing, compression, transmission, rendering, and display techniques. Lastly, we explore various applications enabled by this pioneering technology and we present an array of research challenges and opportunities in the domain of volumetric video services. This survey aspires to provide a holistic understanding of this burgeoning field and shed light on potential future research trajectories, aiming to bring the vision of volumetric video to fruition.Comment: Submitte

    Beyond Transmitting Bits: Context, Semantics, and Task-Oriented Communications

    Get PDF
    Communication systems to date primarily aim at reliably communicating bit sequences. Such an approach provides efficient engineering designs that are agnostic to the meanings of the messages or to the goal that the message exchange aims to achieve. Next generation systems, however, can be potentially enriched by folding message semantics and goals of communication into their design. Further, these systems can be made cognizant of the context in which communication exchange takes place, thereby providing avenues for novel design insights. This tutorial summarizes the efforts to date, starting from its early adaptations, semantic-aware and task-oriented communications, covering the foundations, algorithms and potential implementations. The focus is on approaches that utilize information theory to provide the foundations, as well as the significant role of learning in semantics and task-aware communications

    Network and Content Intelligence for 360 Degree Video Streaming Optimization

    Get PDF
    In recent years, 360° videos, a.k.a. spherical frames, became popular among users creating an immersive streaming experience. Along with the advances in smart- phones and Head Mounted Devices (HMD) technology, many content providers have facilitated to host and stream 360° videos in both on-demand and live stream- ing modes. Therefore, many different applications have already arisen leveraging these immersive videos, especially to give viewers an impression of presence in a digital environment. For example, with 360° videos, now it is possible to connect people in a remote meeting in an interactive way which essentially increases the productivity of the meeting. Also, creating interactive learning materials using 360° videos for students will help deliver the learning outcomes effectively. However, streaming 360° videos is not an easy task due to several reasons. First, 360° video frames are 4–6 times larger than normal video frames to achieve the same quality as a normal video. Therefore, delivering these videos demands higher bandwidth in the network. Second, processing relatively larger frames requires more computational resources at the end devices, particularly for end user devices with limited resources. This will impact not only the delivery of 360° videos but also many other applications running on shared resources. Third, these videos need to be streamed with very low latency requirements due their interactive nature. Inability to satisfy these requirements can result in poor Quality of Experience (QoE) for the user. For example, insufficient bandwidth incurs frequent rebuffer- ing and poor video quality. Also, inadequate computational capacity can cause faster battery draining and unnecessary heating of the device, causing discomfort to the user. Motion or cyber–sickness to the user will be prevalent if there is an unnecessary delay in streaming. These circumstances will hinder providing im- mersive streaming experiences to the much-needed communities, especially those who do not have enough network resources. To address the above challenges, we believe that enhancements to the three main components in video streaming pipeline, server, network and client, are essential. Starting from network, it is beneficial for network providers to identify 360° video flows as early as possible and understand their behaviour in the network to effec- tively allocate sufficient resources for this video delivery without compromising the quality of other services. Content servers, at one end of this streaming pipeline, re- quire efficient 360° video frame processing mechanisms to support adaptive video streaming mechanisms such as ABR (Adaptive Bit Rate) based streaming, VP aware streaming, a streaming paradigm unique to 360° videos that select only part of the larger video frame that fall within the user-visible region, etc. On the other end, the client can be combined with edge-assisted streaming to deliver 360° video content with reduced latency and higher quality. Following the above optimization strategies, in this thesis, first, we propose a mech- anism named 360NorVic to extract 360° video flows from encrypted video traffic and analyze their traffic characteristics. We propose Machine Learning (ML) mod- els to classify 360° and normal videos under different scenarios such as offline, near real-time, VP-aware streaming and Mobile Network Operator (MNO) level stream- ing. Having extracted 360° video traffic traces both in packet and flow level data at higher accuracy, we analyze and understand the differences between 360° and normal video patterns in the encrypted traffic domain that is beneficial for effec- tive resource optimization for enhancing 360° video delivery. Second, we present a WGAN (Wesserstien Generative Adversarial Network) based data generation mechanism (namely VideoTrain++) to synthesize encrypted network video traffic, taking minimal data. Leveraging synthetic data, we show improved performance in 360° video traffic analysis, especially in ML-based classification in 360NorVic. Thirdly, we propose an effective 360° video frame partitioning mechanism (namely VASTile) at the server side to support VP-aware 360° video streaming with dy- namic tiles (or variable tiles) of different sizes and locations on the frame. VASTile takes a visual attention map on the video frames as the input and applies a com- putational geometric approach to generate a non-overlapping tile configuration to cover the video frames adaptive to the visual attention. We present VASTile as a scalable approach for video frame processing at the servers and a method to re- duce bandwidth consumption in network data transmission. Finally, by applying VASTile to the individual user VP at the client side and utilizing cache storage of Multi Access Edge Computing (MEC) servers, we propose OpCASH, a mech- anism to personalize the 360° video streaming with dynamic tiles with the edge assistance. While proposing an ILP based solution to effectively select cached variable tiles from MEC servers that might not be identical to the requested VP tiles by user, but still effectively cover the same VP region, OpCASH maximize the cache utilization and reduce the number of requests to the content servers in congested core network. With this approach, we demonstrate the gain in latency and bandwidth saving and video quality improvement in personalized 360° video streaming

    Images on the Move

    Get PDF
    In contemporary society, digital images have become increasingly mobile. They are networked, shared on social media, and circulated across small and portable screens. Accordingly, the discourses of spreadability and circulation have come to supersede the focus on production, indexicality, and manipulability, which had dominated early conceptions of digital photography and film. However, the mobility of images is neither technologically nor conceptually limited to the realm of the digital. The edited volume re-examines the historical, aesthetical, and theoretical relevance of image mobility. The contributors provide a materialist account of images on the move - ranging from wired photography to postcards to streaming media

    Visual Content Characterization Based on Encoding Rate-Distortion Analysis

    Get PDF
    Visual content characterization is a fundamentally important but under exploited step in dataset construction, which is essential in solving many image processing and computer vision problems. In the era of machine learning, this has become ever more important, because with the explosion of image and video content nowadays, scrutinizing all potential content is impossible and source content selection has become increasingly difficult. In particular, in the area of image/video coding and quality assessment, it is highly desirable to characterize/select source content and subsequently construct image/video datasets that demonstrate strong representativeness and diversity of the visual world, such that the visual coding and quality assessment methods developed from and validated using such datasets exhibit strong generalizability. Encoding Rate-Distortion (RD) analysis is essential for many multimedia applications. Examples of applications that explicitly use RD analysis include image encoder RD optimization, video quality assessment (VQA), and Quality of Experience (QoE) optimization of streaming videos etc. However, encoding RD analysis has not been well investigated in the context of visual content characterization. This thesis focuses on applying encoding RD analysis as a visual source content characterization method with image/video coding and quality assessment applications in mind. We first conduct a video quality subjective evaluation experiment for state-of-the-art video encoder performance analysis and comparison, where our observations reveal severe problems that motivate the needs of better source content characterization and selection methods. Then the effectiveness of RD analysis in visual source content characterization is demonstrated through a proposed quality control mechanism for video coding by eigen analysis in the space of General Quality Parameter (GQP) functions. Finally, by combining encoding RD analysis with submodular set function optimization, we propose a novel method for automating the process of representative source content selection, which helps boost the RD performance of visual encoders trained with the selected visual contents

    Deep learning based objective quality assessment of multidimensional visual content

    Get PDF
    Tese (doutorado) — Universidade de Brasília, Faculdade de Tecnologia, Departamento de Engenharia Elétrica, 2022.Na última década, houve um tremendo aumento na popularidade dos aplicativos multimídia, aumentando assim o conteúdo multimídia. Quando esses conteúdossão gerados, transmitidos, reconstruídos e compartilhados, seus valores de pixel originais são transformados. Nesse cenário, torna-se mais crucial e exigente avaliar a qualidade visual do conteúdo visual afetado para que os requisitos dos usuários finais sejam atendidos. Neste trabalho, investigamos recursos espaciais, temporais e angulares eficazes desenvolvendo algoritmos sem referência que avaliam a qualidade visual de conteúdo visual multidimensional distorcido. Usamos algoritmos de aprendizado de máquina e aprendizado profundo para obter precisão de previsão.Para avaliação de qualidade de imagem bidimensional (2D), usamos padrões binários locais multiescala e informações de saliência e treinamos/testamos esses recursos usando o Random Forest Regressor. Para avaliação de qualidade de vídeo 2D, apresentamos um novo conceito de saliência espacial e temporal e pontuações de qualidade objetivas personalizadas. Usamos um modelo leve baseado em Rede Neural Convolucional (CNN) para treinamento e teste em patches selecionados de quadros de vídeo.Para avaliação objetiva da qualidade de imagens de campo de luz (LFI) em quatro dimensões (4D), propomos sete métodos de avaliação de qualidade LFI (LF-IQA) no total. Considerando que o LFI é composto por multi-views densas, Inspired by Human Visual System (HVS), propomos nosso primeiro método LF-IQA que é baseado em uma arquitetura CNN de dois fluxos. O segundo e terceiro métodos LF-IQA também são baseados em uma arquitetura de dois fluxos, que incorpora CNN, Long Short-Term Memory (LSTM) e diversos recursos de gargalo. O quarto LF-IQA é baseado nas camadas CNN e Atrous Convolution (ACL), enquanto o quinto método usa as camadas CNN, ACL e LSTM. O sexto método LF-IQA também é baseado em uma arquitetura de dois fluxos, na qual EPIs horizontais e verticais são processados no domínio da frequência. Por último, mas não menos importante, o sétimo método LF-IQA é baseado em uma Rede Neural Convolucional de Gráfico. Para todos os métodos mencionados acima, realizamos experimentos intensivos e os resultados mostram que esses métodos superaram os métodos de última geração em conjuntos de dados de qualidade populares.Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES).In the last decade, there has been a tremendous increase in the popularity of multimedia applications, hence increasing multimedia content. When these contents are generated, transmitted, reconstructed and shared, their original pixel values are transformed. In this scenario, it becomes more crucial and demanding to assess visual quality of the affected visual content so that the requirements of end-users are satisfied. In this work, we investigate effective spatial, temporal, and angular features by developing no-reference algorithms that assess the visual quality of distorted multi-dimensional visual content. We use machine learning and deep learning algorithms to obtain prediction accuracy. For two-dimensional (2D) image quality assessment, we use multiscale local binary patterns and saliency information, and train / test these features using Random Forest Regressor. For 2D video quality assessment, we introduce a novel concept of spatial and temporal saliency and custom objective quality scores. We use a Convolutional Neural Network (CNN) based light-weight model for training and testing on selected patches of video frames. For objective quality assessment of four-dimensional (4D) light field images (LFI), we propose seven LFI quality assessment (LF-IQA) methods in total. Considering that LFI is composed of dense multi-views, Inspired by Human Visual System (HVS), we propose our first LF-IQA method that is based on a two-streams CNN architecture. The second and third LF-IQA methods are also based on a two-stream architecture, which incorporates CNN, Long Short-Term Memory (LSTM), and diverse bottleneck features. The fourth LF-IQA is based on CNN and Atrous Convolution layers (ACL), while the fifth method uses CNN, ACL, and LSTM layers. The sixth LF-IQA method is also based on a two-stream architecture, in which, horizontal and vertical EPIs are processed in the frequency domain. Last, but not least, the seventh LF-IQA method is based on a Graph Convolutional Neural Network. For all of the methods mentioned above, we performed intensive experiments, and the results show that these methods outperformed state-of-the-art methods on popular quality datasets

    Applications in Electronics Pervading Industry, Environment and Society

    Get PDF
    This book features the manuscripts accepted for the Special Issue “Applications in Electronics Pervading Industry, Environment and Society—Sensing Systems and Pervasive Intelligence” of the MDPI journal Sensors. Most of the papers come from a selection of the best papers of the 2019 edition of the “Applications in Electronics Pervading Industry, Environment and Society” (APPLEPIES) Conference, which was held in November 2019. All these papers have been significantly enhanced with novel experimental results. The papers give an overview of the trends in research and development activities concerning the pervasive application of electronics in industry, the environment, and society. The focus of these papers is on cyber physical systems (CPS), with research proposals for new sensor acquisition and ADC (analog to digital converter) methods, high-speed communication systems, cybersecurity, big data management, and data processing including emerging machine learning techniques. Physical implementation aspects are discussed as well as the trade-off found between functional performance and hardware/system costs
    corecore