126 research outputs found
VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement
Video to sound generation aims to generate realistic and natural sound given
a video input. However, previous video-to-sound generation methods can only
generate a random or average timbre without any controls or specializations of
the generated sound timbre, leading to the problem that people cannot obtain
the desired timbre under these methods sometimes. In this paper, we pose the
task of generating sound with a specific timbre given a video input and a
reference audio sample. To solve this task, we disentangle each target sound
audio into three components: temporal information, acoustic information, and
background information. We first use three encoders to encode these components
respectively: 1) a temporal encoder to encode temporal information, which is
fed with video frames since the input video shares the same temporal
information as the original audio; 2) an acoustic encoder to encode timbre
information, which takes the original audio as input and discards its temporal
information by a temporal-corrupting operation; and 3) a background encoder to
encode the residual or background sound, which uses the background part of the
original audio as input. To make the generated result achieve better quality
and temporal alignment, we also adopt a mel discriminator and a temporal
discriminator for the adversarial training. Our experimental results on the VAS
dataset demonstrate that our method can generate high-quality audio samples
with good synchronization with events in video and high timbre similarity with
the reference audio
Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer
Direct speech-to-speech translation (S2ST) with discrete self-supervised
representations has achieved remarkable accuracy, but is unable to preserve the
speaker timbre of the source speech during translation. Meanwhile, the scarcity
of high-quality speaker-parallel data poses a challenge for learning style
transfer between source and target speech. We propose an S2ST framework with an
acoustic language model based on discrete units from a self-supervised model
and a neural codec for style transfer. The acoustic language model leverages
self-supervised in-context learning, acquiring the ability for style transfer
without relying on any speaker-parallel data, thereby overcoming the issue of
data scarcity. By using extensive training data, our model achieves zero-shot
cross-lingual style transfer on previously unseen source languages. Experiments
show that our model generates translated speeches with high fidelity and style
similarity. Audio samples are available at http://stylelm.github.io/ .Comment: 5 pages, 1 figure. submitted to ICASSP 202
Detector Guidance for Multi-Object Text-to-Image Generation
Diffusion models have demonstrated impressive performance in text-to-image
generation. They utilize a text encoder and cross-attention blocks to infuse
textual information into images at a pixel level. However, their capability to
generate images with text containing multiple objects is still restricted.
Previous works identify the problem of information mixing in the CLIP text
encoder and introduce the T5 text encoder or incorporate strong prior knowledge
to assist with the alignment. We find that mixing problems also occur on the
image side and in the cross-attention blocks. The noisy images can cause
different objects to appear similar, and the cross-attention blocks inject
information at a pixel level, leading to leakage of global object understanding
and resulting in object mixing. In this paper, we introduce Detector Guidance
(DG), which integrates a latent object detection model to separate different
objects during the generation process. DG first performs latent object
detection on cross-attention maps (CAMs) to obtain object information. Based on
this information, DG then masks conflicting prompts and enhances related
prompts by manipulating the following CAMs. We evaluate the effectiveness of DG
using Stable Diffusion on COCO, CC, and a novel multi-related object benchmark,
MRO. Human evaluations demonstrate that DG provides an 8-22\% advantage in
preventing the amalgamation of conflicting concepts and ensuring that each
object possesses its unique region without any human involvement and additional
iterations. Our implementation is available at
\url{https://github.com/luping-liu/Detector-Guidance}
Chat-3D v2: Bridging 3D Scene and Large Language Models with Object Identifiers
Recent research has evidenced the significant potentials of Large Language
Models (LLMs) in handling challenging tasks within 3D scenes. However, current
models are constrained to addressing object-centric tasks, where each
question-answer pair focuses solely on an individual object. In real-world
applications, users may pose queries involving multiple objects or expect for
answers that precisely reference various objects. We introduce the use of
object identifiers to freely reference objects during a conversation. While
this solution appears straightforward, it presents two main challenges: 1) How
to establish a reliable one-to-one correspondence between each object and its
identifier? 2) How to incorporate complex spatial relationships among dozens of
objects into the embedding space of the LLM? To address these challenges, we
propose a two-stage alignment method, which involves learning an
attribute-aware token and a relation-aware token for each object. These tokens
capture the object's attributes and spatial relationships with surrounding
objects in the 3D scene. Once the alignment is established, we can fine-tune
our model on various downstream tasks using instruction tuning. Experiments
conducted on traditional datasets like ScanQA, ScanRefer, and Nr3D/Sr3D
showcase the effectiveness of our proposed method. Additionally, we create a 3D
scene captioning dataset annotated with rich object identifiers, with the
assistant of GPT-4. This dataset aims to further explore the capability of
object identifiers in effective object referencing and precise scene
understanding
FluentSpeech: Stutter-Oriented Automatic Speech Editing with Context-Aware Diffusion Models
Stutter removal is an essential scenario in the field of speech editing.
However, when the speech recording contains stutters, the existing text-based
speech editing approaches still suffer from: 1) the over-smoothing problem in
the edited speech; 2) lack of robustness due to the noise introduced by
stutter; 3) to remove the stutters, users are required to determine the edited
region manually. To tackle the challenges in stutter removal, we propose
FluentSpeech, a stutter-oriented automatic speech editing model. Specifically,
1) we propose a context-aware diffusion model that iteratively refines the
modified mel-spectrogram with the guidance of context features; 2) we introduce
a stutter predictor module to inject the stutter information into the hidden
sequence; 3) we also propose a stutter-oriented automatic speech editing (SASE)
dataset that contains spontaneous speech recordings with time-aligned stutter
labels to train the automatic stutter localization model. Experimental results
on VCTK and LibriTTS datasets demonstrate that our model achieves
state-of-the-art performance on speech editing. Further experiments on our SASE
dataset show that FluentSpeech can effectively improve the fluency of
stuttering speech in terms of objective and subjective metrics. Code and audio
samples can be found at https://github.com/Zain-Jiang/Speech-Editing-Toolkit.Comment: Accepted by ACL 2023 (Findings
- …