335 research outputs found
Using TB-Sized Data to Understand Multi-Device Advertising
In this study, we combine the conversion funnel theory with machine learning methods to understand multi-device advertising. We investigate the important question of how the distribution of ads on multiple devices affects the consumer path to purchase. To handle the sheer volume of TB sized impression data, we develop a MapReduce framework to estimate the non-stationary Hidden Markov Model in parallel. To accommodate the iterative nature of the estimation procedure, we leverage the Apache Spark framework and a corporate cloud computing service. We calibrate the model with hundreds of millions of impressions for 100 advertisers. Our preliminary results show increasing the diversity of device for ads delivery can consistently encourage consumers to become more engaged. In addition, advertiser heterogeneity plays an important role in the variety of the conversion process
Viewport Prediction for Volumetric Video Streaming by Exploring Video Saliency and Trajectory Information
Volumetric video, also known as hologram video, is a novel medium that
portrays natural content in Virtual Reality (VR), Augmented Reality (AR), and
Mixed Reality (MR). It is expected to be the next-gen video technology and a
prevalent use case for 5G and beyond wireless communication. Considering that
each user typically only watches a section of the volumetric video, known as
the viewport, it is essential to have precise viewport prediction for optimal
performance. However, research on this topic is still in its infancy. In the
end, this paper presents and proposes a novel approach, named Saliency and
Trajectory Viewport Prediction (STVP), which aims to improve the precision of
viewport prediction in volumetric video streaming. The STVP extensively
utilizes video saliency information and viewport trajectory. To our knowledge,
this is the first comprehensive study of viewport prediction in volumetric
video streaming. In particular, we introduce a novel sampling method, Uniform
Random Sampling (URS), to reduce computational complexity while still
preserving video features in an efficient manner. Then we present a saliency
detection technique that incorporates both spatial and temporal information for
detecting static, dynamic geometric, and color salient regions. Finally, we
intelligently fuse saliency and trajectory information to achieve more accurate
viewport prediction. We conduct extensive simulations to evaluate the
effectiveness of our proposed viewport prediction methods using
state-of-the-art volumetric video sequences. The experimental results show the
superiority of the proposed method over existing schemes. The dataset and
source code will be publicly accessible after acceptance
3D-GOI: 3D GAN Omni-Inversion for Multifaceted and Multi-object Editing
The current GAN inversion methods typically can only edit the appearance and
shape of a single object and background while overlooking spatial information.
In this work, we propose a 3D editing framework, 3D-GOI, to enable multifaceted
editing of affine information (scale, translation, and rotation) on multiple
objects. 3D-GOI realizes the complex editing function by inverting the
abundance of attribute codes (object
shape/appearance/scale/rotation/translation, background shape/appearance, and
camera pose) controlled by GIRAFFE, a renowned 3D GAN. Accurately inverting all
the codes is challenging, 3D-GOI solves this challenge following three main
steps. First, we segment the objects and the background in a multi-object
image. Second, we use a custom Neural Inversion Encoder to obtain coarse codes
of each object. Finally, we use a round-robin optimization algorithm to get
precise codes to reconstruct the image. To the best of our knowledge, 3D-GOI is
the first framework to enable multifaceted editing on multiple objects. Both
qualitative and quantitative experiments demonstrate that 3D-GOI holds immense
potential for flexible, multifaceted editing in complex multi-object scenes
Wav2vec-S: Semi-Supervised Pre-Training for Low-Resource ASR
Self-supervised pre-training could effectively improve the performance of
low-resource automatic speech recognition (ASR). However, existing
self-supervised pre-training are task-agnostic, i.e., could be applied to
various downstream tasks. Although it enlarges the scope of its application,
the capacity of the pre-trained model is not fully utilized for the ASR task,
and the learned representations may not be optimal for ASR. In this work, in
order to build a better pre-trained model for low-resource ASR, we propose a
pre-training approach called wav2vec-S, where we use task-specific
semi-supervised pre-training to refine the self-supervised pre-trained model
for the ASR task thus more effectively utilize the capacity of the pre-trained
model to generate task-specific representations for ASR. Experiments show that
compared to wav2vec 2.0, wav2vec-S only requires a marginal increment of
pre-training time but could significantly improve ASR performance on in-domain,
cross-domain and cross-lingual datasets. Average relative WER reductions are
24.5% and 6.6% for 1h and 10h fine-tuning, respectively. Furthermore, we show
that semi-supervised pre-training could close the representation gap between
the self-supervised pre-trained model and the corresponding fine-tuned model
through canonical correlation analysis.Comment: Accepted by Interspeech 202
Expressive paragraph text-to-speech synthesis with multi-step variational autoencoder
Neural networks have been able to generate high-quality single-sentence
speech with substantial expressiveness. However, it remains a challenge
concerning paragraph-level speech synthesis due to the need for coherent
acoustic features while delivering fluctuating speech styles. Meanwhile,
training these models directly on over-length speech leads to a deterioration
in the quality of synthesis speech. To address these problems, we propose a
high-quality and expressive paragraph speech synthesis system with a multi-step
variational autoencoder. Specifically, we employ multi-step latent variables to
capture speech information at different grammatical levels before utilizing
these features in parallel to generate speech waveform. We also propose a
three-step training method to improve the decoupling ability. Our model was
trained on a single-speaker French audiobook corpus released at Blizzard
Challenge 2023. Experimental results underscore the significant superiority of
our system over baseline models.Comment: 5 pages, 1 figure, 2 table
- …