30 research outputs found
Streaming and User Behaviour in Omnidirectional Videos
Omnidirectional videos (ODVs) have gone beyond the passive paradigm of traditional video,
offering higher degrees of immersion and interaction. The revolutionary novelty of this technology is the possibility for users to interact with the surrounding environment, and to feel a
sense of engagement and presence in a virtual space. Users are clearly the main driving force of
immersive applications and consequentially the services need to be properly tailored to them.
In this context, this chapter highlights the importance of the new role of users in ODV streaming applications, and thus the need for understanding their behaviour while navigating within
ODVs. A comprehensive overview of the research efforts aimed at advancing ODV streaming
systems is also presented. In particular, the state-of-the-art solutions under examination in this
chapter are distinguished in terms of system-centric and user-centric streaming approaches: the
former approach comes from a quite straightforward extension of well-established solutions for
the 2D video pipeline while the latter one takes the benefit of understanding users’ behaviour
and enable more personalised ODV streaming
Tile Classification Based Viewport Prediction with Multi-modal Fusion Transformer
Viewport prediction is a crucial aspect of tile-based 360 video streaming
system. However, existing trajectory based methods lack of robustness, also
oversimplify the process of information construction and fusion between
different modality inputs, leading to the error accumulation problem. In this
paper, we propose a tile classification based viewport prediction method with
Multi-modal Fusion Transformer, namely MFTR. Specifically, MFTR utilizes
transformer-based networks to extract the long-range dependencies within each
modality, then mine intra- and inter-modality relations to capture the combined
impact of user historical inputs and video contents on future viewport
selection. In addition, MFTR categorizes future tiles into two categories: user
interested or not, and selects future viewport as the region that contains most
user interested tiles. Comparing with predicting head trajectories, choosing
future viewport based on tile's binary classification results exhibits better
robustness and interpretability. To evaluate our proposed MFTR, we conduct
extensive experiments on two widely used PVS-HM and Xu-Gaze dataset. MFTR shows
superior performance over state-of-the-art methods in terms of average
prediction accuracy and overlap ratio, also presents competitive computation
efficiency.Comment: This paper is accepted by ACM-MM 202
Network and Content Intelligence for 360 Degree Video Streaming Optimization
In recent years, 360° videos, a.k.a. spherical frames, became popular among users
creating an immersive streaming experience. Along with the advances in smart-
phones and Head Mounted Devices (HMD) technology, many content providers
have facilitated to host and stream 360° videos in both on-demand and live stream-
ing modes. Therefore, many different applications have already arisen leveraging
these immersive videos, especially to give viewers an impression of presence in a
digital environment. For example, with 360° videos, now it is possible to connect
people in a remote meeting in an interactive way which essentially increases the
productivity of the meeting. Also, creating interactive learning materials using
360° videos for students will help deliver the learning outcomes effectively.
However, streaming 360° videos is not an easy task due to several reasons. First,
360° video frames are 4–6 times larger than normal video frames to achieve the
same quality as a normal video. Therefore, delivering these videos demands higher
bandwidth in the network. Second, processing relatively larger frames requires
more computational resources at the end devices, particularly for end user devices
with limited resources. This will impact not only the delivery of 360° videos but
also many other applications running on shared resources. Third, these videos need
to be streamed with very low latency requirements due their interactive nature.
Inability to satisfy these requirements can result in poor Quality of Experience
(QoE) for the user. For example, insufficient bandwidth incurs frequent rebuffer-
ing and poor video quality. Also, inadequate computational capacity can cause
faster battery draining and unnecessary heating of the device, causing discomfort
to the user. Motion or cyber–sickness to the user will be prevalent if there is an
unnecessary delay in streaming. These circumstances will hinder providing im-
mersive streaming experiences to the much-needed communities, especially those
who do not have enough network resources.
To address the above challenges, we believe that enhancements to the three main
components in video streaming pipeline, server, network and client, are essential.
Starting from network, it is beneficial for network providers to identify 360° video
flows as early as possible and understand their behaviour in the network to effec-
tively allocate sufficient resources for this video delivery without compromising the
quality of other services. Content servers, at one end of this streaming pipeline, re-
quire efficient 360° video frame processing mechanisms to support adaptive video streaming mechanisms such as ABR (Adaptive Bit Rate) based streaming, VP
aware streaming, a streaming paradigm unique to 360° videos that select only
part of the larger video frame that fall within the user-visible region, etc. On the
other end, the client can be combined with edge-assisted streaming to deliver 360°
video content with reduced latency and higher quality.
Following the above optimization strategies, in this thesis, first, we propose a mech-
anism named 360NorVic to extract 360° video flows from encrypted video traffic
and analyze their traffic characteristics. We propose Machine Learning (ML) mod-
els to classify 360° and normal videos under different scenarios such as offline, near
real-time, VP-aware streaming and Mobile Network Operator (MNO) level stream-
ing. Having extracted 360° video traffic traces both in packet and flow level data
at higher accuracy, we analyze and understand the differences between 360° and
normal video patterns in the encrypted traffic domain that is beneficial for effec-
tive resource optimization for enhancing 360° video delivery. Second, we present
a WGAN (Wesserstien Generative Adversarial Network) based data generation
mechanism (namely VideoTrain++) to synthesize encrypted network video traffic,
taking minimal data. Leveraging synthetic data, we show improved performance
in 360° video traffic analysis, especially in ML-based classification in 360NorVic.
Thirdly, we propose an effective 360° video frame partitioning mechanism (namely
VASTile) at the server side to support VP-aware 360° video streaming with dy-
namic tiles (or variable tiles) of different sizes and locations on the frame. VASTile
takes a visual attention map on the video frames as the input and applies a com-
putational geometric approach to generate a non-overlapping tile configuration to
cover the video frames adaptive to the visual attention. We present VASTile as a
scalable approach for video frame processing at the servers and a method to re-
duce bandwidth consumption in network data transmission. Finally, by applying
VASTile to the individual user VP at the client side and utilizing cache storage
of Multi Access Edge Computing (MEC) servers, we propose OpCASH, a mech-
anism to personalize the 360° video streaming with dynamic tiles with the edge
assistance. While proposing an ILP based solution to effectively select cached
variable tiles from MEC servers that might not be identical to the requested VP
tiles by user, but still effectively cover the same VP region, OpCASH maximize
the cache utilization and reduce the number of requests to the content servers in
congested core network. With this approach, we demonstrate the gain in latency
and bandwidth saving and video quality improvement in personalized 360° video
streaming
Machine Learning for Multimedia Communications
Machine learning is revolutionizing the way multimedia information is processed and transmitted to users. After intensive and powerful training, some impressive efficiency/accuracy improvements have been made all over the transmission pipeline. For example, the high model capacity of the learning-based architectures enables us to accurately model the image and video behavior such that tremendous compression gains can be achieved. Similarly, error concealment, streaming strategy or even user perception modeling have widely benefited from the recent learning-oriented developments. However, learning-based algorithms often imply drastic changes to the way data are represented or consumed, meaning that the overall pipeline can be affected even though a subpart of it is optimized. In this paper, we review the recent major advances that have been proposed all across the transmission chain, and we discuss their potential impact and the research challenges that they raise