93 research outputs found
UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data
We propose UnitSpeech, a speaker-adaptive speech synthesis method that
fine-tunes a diffusion-based text-to-speech (TTS) model using minimal
untranscribed data. To achieve this, we use the self-supervised unit
representation as a pseudo transcript and integrate the unit encoder into the
pre-trained TTS model. We train the unit encoder to provide speech content to
the diffusion-based decoder and then fine-tune the decoder for speaker
adaptation to the reference speaker using a single pair.
UnitSpeech performs speech synthesis tasks such as TTS and voice conversion
(VC) in a personalized manner without requiring model re-training for each
task. UnitSpeech achieves comparable and superior results on personalized TTS
and any-to-any VC tasks compared to previous baselines. Our model also shows
widespread adaptive performance on real-world data and other tasks that use a
unit sequence as input.Comment: INTERSPEECH 2023, Ora
Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation
Sound can convey significant information for spatial reasoning in our daily
lives. To endow deep networks with such ability, we address the challenge of
dense indoor prediction with sound in both 2D and 3D via cross-modal knowledge
distillation. In this work, we propose a Spatial Alignment via Matching (SAM)
distillation framework that elicits local correspondence between the two
modalities in vision-to-audio knowledge transfer. SAM integrates audio features
with visually coherent learnable spatial embeddings to resolve inconsistencies
in multiple layers of a student model. Our approach does not rely on a specific
input representation, allowing for flexibility in the input shapes or
dimensions without performance degradation. With a newly curated benchmark
named Dense Auditory Prediction of Surroundings (DAPS), we are the first to
tackle dense indoor prediction of omnidirectional surroundings in both 2D and
3D with audio observations. Specifically, for audio-based depth estimation,
semantic segmentation, and challenging 3D scene reconstruction, the proposed
distillation framework consistently achieves state-of-the-art performance
across various metrics and backbone architectures.Comment: Published to ICCV202
Panoramic Vision Transformer for Saliency Detection in 360{\deg} Videos
360 video saliency detection is one of the challenging benchmarks for
360 video understanding since non-negligible distortion and
discontinuity occur in the projection of any format of 360 videos, and
capture-worthy viewpoint in the omnidirectional sphere is ambiguous by nature.
We present a new framework named Panoramic Vision Transformer (PAVER). We
design the encoder using Vision Transformer with deformable convolution, which
enables us not only to plug pretrained models from normal videos into our
architecture without additional modules or finetuning but also to perform
geometric approximation only once, unlike previous deep CNN-based approaches.
Thanks to its powerful encoder, PAVER can learn the saliency from three simple
relative relations among local patch features, outperforming state-of-the-art
models for the Wild360 benchmark by large margins without supervision or
auxiliary information like class activation. We demonstrate the utility of our
saliency prediction model with the omnidirectional video quality assessment
task in VQA-ODV, where we consistently improve performance without any form of
supervision, including head movement.Comment: Published to ECCV202
Stability of the Standard Model vacuum with respect to vacuum tunneling to the Komatsu vacuum in the cMSSM
We investigate the stability of the Standard Model vacuum with respect to
vacuum tunneling to the Komatsu vacuum, which exists when , in the cMSSM. Employing the numerical tools SARAH, SPheno and
CosmoTransitions, we scan and constrain the parameter space of the cMSSM up to
10 TeV. Regions excluded due to having a vacuum tunneling half-life less than
the age of the observable universe are concentrated near the regions where the
Standard Model vacuum is tachyonic and are more stringent at smaller ,
larger and negative , and larger . New excluded regions, which
satisfy , are found.Comment: 14 pages, 4 figure
DyAnNet: A Scene Dynamicity Guided Self-Trained Video Anomaly Detection Network
Unsupervised approaches for video anomaly detection may not perform as good
as supervised approaches. However, learning unknown types of anomalies using an
unsupervised approach is more practical than a supervised approach as
annotation is an extra burden. In this paper, we use isolation tree-based
unsupervised clustering to partition the deep feature space of the video
segments. The RGB- stream generates a pseudo anomaly score and the flow stream
generates a pseudo dynamicity score of a video segment. These scores are then
fused using a majority voting scheme to generate preliminary bags of positive
and negative segments. However, these bags may not be accurate as the scores
are generated only using the current segment which does not represent the
global behavior of a typical anomalous event. We then use a refinement strategy
based on a cross-branch feed-forward network designed using a popular I3D
network to refine both scores. The bags are then refined through a segment
re-mapping strategy. The intuition of adding the dynamicity score of a segment
with the anomaly score is to enhance the quality of the evidence. The method
has been evaluated on three popular video anomaly datasets, i.e., UCF-Crime,
CCTV-Fights, and UBI-Fights. Experimental results reveal that the proposed
framework achieves competitive accuracy as compared to the state-of-the-art
video anomaly detection methods.Comment: 10 pages, 8 figures, and 4 tables. (ACCEPTED AT WACV 2023
Edit-A-Video: Single Video Editing with Object-Aware Consistency
Despite the fact that text-to-video (TTV) model has recently achieved
remarkable success, there have been few approaches on TTV for its extension to
video editing. Motivated by approaches on TTV models adapting from
diffusion-based text-to-image (TTI) models, we suggest the video editing
framework given only a pretrained TTI model and a single pair,
which we term Edit-A-Video. The framework consists of two stages: (1) inflating
the 2D model into the 3D model by appending temporal modules and tuning on the
source video (2) inverting the source video into the noise and editing with
target text prompt and attention map injection. Each stage enables the temporal
modeling and preservation of semantic attributes of the source video. One of
the key challenges for video editing include a background inconsistency
problem, where the regions not included for the edit suffer from undesirable
and inconsistent temporal alterations. To mitigate this issue, we also
introduce a novel mask blending method, termed as sparse-causal blending (SC
Blending). We improve previous mask blending methods to reflect the temporal
consistency so that the area where the editing is applied exhibits smooth
transition while also achieving spatio-temporal consistency of the unedited
regions. We present extensive experimental results over various types of text
and videos, and demonstrate the superiority of the proposed method compared to
baselines in terms of background consistency, text alignment, and video editing
quality
- β¦