30 research outputs found
Metaverse: A Young Gamer's Perspective
When developing technologies for the Metaverse, it is important to understand
the needs and requirements of end users. Relatively little is known about the
specific perspectives on the use of the Metaverse by the youngest audience:
children ten and under. This paper explores the Metaverse from the perspective
of a young gamer. It examines their understanding of the Metaverse in relation
to the physical world and other technologies they may be familiar with, looks
at some of their expectations of the Metaverse, and then relates these to the
specific multimedia signal processing (MMSP) research challenges. The
perspectives presented in the paper may be useful for planning more detailed
subjective experiments involving young gamers, as well as informing the
research on MMSP technologies targeted at these users.Comment: 6 pages, 5 figures, IEEE MMSP 202
LCCM-VC: Learned Conditional Coding Modes for Video Compression
End-to-end learning-based video compression has made steady progress over the
last several years. However, unlike learning-based image coding, which has
already surpassed its handcrafted counterparts, learning-based video coding
still has some ways to go. In this paper we present learned conditional coding
modes for video coding (LCCM-VC), a video coding model that achieves
state-of-the-art results among learning-based video coding methods. Our model
utilizes conditional coding engines from the recent conditional augmented
normalizing flows (CANF) pipeline, and introduces additional coding modes to
improve compression performance. The compression efficiency is especially good
in the high-quality/high-bitrate range, which is important for broadcast and
video-on-demand streaming applications. The implementation of LCCM-VC is
available at https://github.com/hadihdz/lccm_vcComment: 5 pages, 3 figures, IEEE ICASSP 202
Tensor Completion Methods for Collaborative Intelligence
In the race to bring Artificial Intelligence (AI) to the edge, collaborative intelligence has emerged as a promising way to lighten the computation load on edge devices that run applications based on Deep Neural Networks (DNNs). Typically, a deep model is split at a given layer into edge and cloud sub-models. The deep feature tensor produced by the edge sub-model is transmitted to the cloud, where the remaining computationally intensive workload is performed by the cloud sub-model. The communication channel between the edge and cloud is imperfect, which will result in missing data in the deep feature tensor received at the cloud side, an issue that has mostly been ignored by existing literature on the topic. In this paper we study four methods for recovering missing data in the deep feature tensor. Three of the studied methods are existing, generic tensor completion methods, and are adapted here to recover deep feature tensor data, while the fourth method is newly developed specifically for deep feature tensor completion. Simulation studies show that the new method is 3–18 times faster than the other three methods, which is an important consideration in collaborative intelligence. For VGG16’s sparse tensors, all methods produce statistically equivalent classification results across all loss levels tested. For ResNet34’s non-sparse tensors, the new method offers statistically better classification accuracy (by 0.25%–6.30%) compared to other methods for matched execution speeds, and second-best accuracy among the four methods when they are allowed to run until convergence
Learned Scalable Video Coding For Humans and Machines
Video coding has traditionally been developed to support services such as
video streaming, videoconferencing, digital TV, and so on. The main intent was
to enable human viewing of the encoded content. However, with the advances in
deep neural networks (DNNs), encoded video is increasingly being used for
automatic video analytics performed by machines. In applications such as
automatic traffic monitoring, analytics such as vehicle detection, tracking and
counting, would run continuously, while human viewing could be required
occasionally to review potential incidents. To support such applications, a new
paradigm for video coding is needed that will facilitate efficient
representation and compression of video for both machine and human use in a
scalable manner. In this manuscript, we introduce the first end-to-end
learnable video codec that supports a machine vision task in its base layer,
while its enhancement layer supports input reconstruction for human viewing.
The proposed system is constructed based on the concept of conditional coding
to achieve better compression gains. Comprehensive experimental evaluations
conducted on four standard video datasets demonstrate that our framework
outperforms both state-of-the-art learned and conventional video codecs in its
base layer, while maintaining comparable performance on the human vision task
in its enhancement layer. We will provide the implementation of the proposed
system at www.github.com upon completion of the review process.Comment: 14 pages, 16 figure
Scalable Video Coding for Humans and Machines
Video content is watched not only by humans, but increasingly also by
machines. For example, machine learning models analyze surveillance video for
security and traffic monitoring, search through YouTube videos for
inappropriate content, and so on. In this paper, we propose a scalable video
coding framework that supports machine vision (specifically, object detection)
through its base layer bitstream and human vision via its enhancement layer
bitstream. The proposed framework includes components from both conventional
and Deep Neural Network (DNN)-based video coding. The results show that on
object detection, the proposed framework achieves 13-19% bit savings compared
to state-of-the-art video codecs, while remaining competitive in terms of
MS-SSIM on the human vision task.Comment: 6 pages, 5 figures, IEEE MMSP 202