120 research outputs found
VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement
Video to sound generation aims to generate realistic and natural sound given
a video input. However, previous video-to-sound generation methods can only
generate a random or average timbre without any controls or specializations of
the generated sound timbre, leading to the problem that people cannot obtain
the desired timbre under these methods sometimes. In this paper, we pose the
task of generating sound with a specific timbre given a video input and a
reference audio sample. To solve this task, we disentangle each target sound
audio into three components: temporal information, acoustic information, and
background information. We first use three encoders to encode these components
respectively: 1) a temporal encoder to encode temporal information, which is
fed with video frames since the input video shares the same temporal
information as the original audio; 2) an acoustic encoder to encode timbre
information, which takes the original audio as input and discards its temporal
information by a temporal-corrupting operation; and 3) a background encoder to
encode the residual or background sound, which uses the background part of the
original audio as input. To make the generated result achieve better quality
and temporal alignment, we also adopt a mel discriminator and a temporal
discriminator for the adversarial training. Our experimental results on the VAS
dataset demonstrate that our method can generate high-quality audio samples
with good synchronization with events in video and high timbre similarity with
the reference audio
Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer
Direct speech-to-speech translation (S2ST) with discrete self-supervised
representations has achieved remarkable accuracy, but is unable to preserve the
speaker timbre of the source speech during translation. Meanwhile, the scarcity
of high-quality speaker-parallel data poses a challenge for learning style
transfer between source and target speech. We propose an S2ST framework with an
acoustic language model based on discrete units from a self-supervised model
and a neural codec for style transfer. The acoustic language model leverages
self-supervised in-context learning, acquiring the ability for style transfer
without relying on any speaker-parallel data, thereby overcoming the issue of
data scarcity. By using extensive training data, our model achieves zero-shot
cross-lingual style transfer on previously unseen source languages. Experiments
show that our model generates translated speeches with high fidelity and style
similarity. Audio samples are available at http://stylelm.github.io/ .Comment: 5 pages, 1 figure. submitted to ICASSP 202
Detector Guidance for Multi-Object Text-to-Image Generation
Diffusion models have demonstrated impressive performance in text-to-image
generation. They utilize a text encoder and cross-attention blocks to infuse
textual information into images at a pixel level. However, their capability to
generate images with text containing multiple objects is still restricted.
Previous works identify the problem of information mixing in the CLIP text
encoder and introduce the T5 text encoder or incorporate strong prior knowledge
to assist with the alignment. We find that mixing problems also occur on the
image side and in the cross-attention blocks. The noisy images can cause
different objects to appear similar, and the cross-attention blocks inject
information at a pixel level, leading to leakage of global object understanding
and resulting in object mixing. In this paper, we introduce Detector Guidance
(DG), which integrates a latent object detection model to separate different
objects during the generation process. DG first performs latent object
detection on cross-attention maps (CAMs) to obtain object information. Based on
this information, DG then masks conflicting prompts and enhances related
prompts by manipulating the following CAMs. We evaluate the effectiveness of DG
using Stable Diffusion on COCO, CC, and a novel multi-related object benchmark,
MRO. Human evaluations demonstrate that DG provides an 8-22\% advantage in
preventing the amalgamation of conflicting concepts and ensuring that each
object possesses its unique region without any human involvement and additional
iterations. Our implementation is available at
\url{https://github.com/luping-liu/Detector-Guidance}
FluentSpeech: Stutter-Oriented Automatic Speech Editing with Context-Aware Diffusion Models
Stutter removal is an essential scenario in the field of speech editing.
However, when the speech recording contains stutters, the existing text-based
speech editing approaches still suffer from: 1) the over-smoothing problem in
the edited speech; 2) lack of robustness due to the noise introduced by
stutter; 3) to remove the stutters, users are required to determine the edited
region manually. To tackle the challenges in stutter removal, we propose
FluentSpeech, a stutter-oriented automatic speech editing model. Specifically,
1) we propose a context-aware diffusion model that iteratively refines the
modified mel-spectrogram with the guidance of context features; 2) we introduce
a stutter predictor module to inject the stutter information into the hidden
sequence; 3) we also propose a stutter-oriented automatic speech editing (SASE)
dataset that contains spontaneous speech recordings with time-aligned stutter
labels to train the automatic stutter localization model. Experimental results
on VCTK and LibriTTS datasets demonstrate that our model achieves
state-of-the-art performance on speech editing. Further experiments on our SASE
dataset show that FluentSpeech can effectively improve the fluency of
stuttering speech in terms of objective and subjective metrics. Code and audio
samples can be found at https://github.com/Zain-Jiang/Speech-Editing-Toolkit.Comment: Accepted by ACL 2023 (Findings
- …