91 research outputs found
VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement
Video to sound generation aims to generate realistic and natural sound given
a video input. However, previous video-to-sound generation methods can only
generate a random or average timbre without any controls or specializations of
the generated sound timbre, leading to the problem that people cannot obtain
the desired timbre under these methods sometimes. In this paper, we pose the
task of generating sound with a specific timbre given a video input and a
reference audio sample. To solve this task, we disentangle each target sound
audio into three components: temporal information, acoustic information, and
background information. We first use three encoders to encode these components
respectively: 1) a temporal encoder to encode temporal information, which is
fed with video frames since the input video shares the same temporal
information as the original audio; 2) an acoustic encoder to encode timbre
information, which takes the original audio as input and discards its temporal
information by a temporal-corrupting operation; and 3) a background encoder to
encode the residual or background sound, which uses the background part of the
original audio as input. To make the generated result achieve better quality
and temporal alignment, we also adopt a mel discriminator and a temporal
discriminator for the adversarial training. Our experimental results on the VAS
dataset demonstrate that our method can generate high-quality audio samples
with good synchronization with events in video and high timbre similarity with
the reference audio
C2G2: Controllable Co-speech Gesture Generation with Latent Diffusion Model
Co-speech gesture generation is crucial for automatic digital avatar
animation. However, existing methods suffer from issues such as unstable
training and temporal inconsistency, particularly in generating high-fidelity
and comprehensive gestures. Additionally, these methods lack effective control
over speaker identity and temporal editing of the generated gestures. Focusing
on capturing temporal latent information and applying practical controlling, we
propose a Controllable Co-speech Gesture Generation framework, named C2G2.
Specifically, we propose a two-stage temporal dependency enhancement strategy
motivated by latent diffusion models. We further introduce two key features to
C2G2, namely a speaker-specific decoder to generate speaker-related real-length
skeletons and a repainting strategy for flexible gesture generation/editing.
Extensive experiments on benchmark gesture datasets verify the effectiveness of
our proposed C2G2 compared with several state-of-the-art baselines. The link of
the project demo page can be found at https://c2g2-gesture.github.io/c2_gestureComment: 12 pages, 6 figures, 7 table
Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech
Polyphone disambiguation aims to capture accurate pronunciation knowledge
from natural text sequences for reliable Text-to-speech (TTS) systems. However,
previous approaches require substantial annotated training data and additional
efforts from language experts, making it difficult to extend high-quality
neural TTS systems to out-of-domain daily conversations and countless languages
worldwide. This paper tackles the polyphone disambiguation problem from a
concise and novel perspective: we propose Dict-TTS, a semantic-aware generative
text-to-speech model with an online website dictionary (the existing prior
information in the natural language). Specifically, we design a
semantics-to-pronunciation attention (S2PA) module to match the semantic
patterns between the input text sequence and the prior semantics in the
dictionary and obtain the corresponding pronunciations; The S2PA module can be
easily trained with the end-to-end TTS model without any annotated phoneme
labels. Experimental results in three languages show that our model outperforms
several strong baseline models in terms of pronunciation accuracy and improves
the prosody modeling of TTS systems. Further extensive analyses demonstrate
that each design in Dict-TTS is effective. The code is available at
\url{https://github.com/Zain-Jiang/Dict-TTS}.Comment: Accepted by NeurIPS 202
Autophagy Inhibitor LRPPRC Suppresses Mitophagy through Interaction with Mitophagy Initiator Parkin
Autophagy plays an important role in tumorigenesis. Mitochondrion-associated protein LRPPRC interacts with MAP1S that interacts with LC3 and bridges autophagy components with microtubules and mitochondria to affect autophagy flux. Dysfunction of LRPPRC and MAP1S is associated with poor survival of ovarian cancer patients. Furthermore, elevated levels of LRPPRC predict shorter overall survival in patients with prostate adenocarcinomas or gastric cancer. To understand the role of LRPPRC in tumor development, previously we reported that LRPPRC forms a ternary complex with Beclin 1 and Bcl-2 to inhibit autophagy. Here we further show that LRPPRC maintains the stability of Parkin that mono-ubiquitinates Bcl-2 to increase Bcl-2 stability to inhibit autophagy. Under mitophagy stress, Parkin translocates to mitochondria to cause rupture of outer mitochondrial membrane and bind with exposed LRPPRC. Consequently, LRPPRC and Parkin help mitochondria being engulfed in autophagosomes to be degraded. In cells under long-term mitophagy stress, both LRPPRC and Parkin become depleted coincident with disappearance of mitochondria and final autophagy inactivation due to depletion of ATG5-ATG12 conjugates. LRPPRC functions as a checkpoint protein that prevents mitochondria from autophagy degradation and impact tumorigenesis
- …