91 research outputs found

    VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement

    Full text link
    Video to sound generation aims to generate realistic and natural sound given a video input. However, previous video-to-sound generation methods can only generate a random or average timbre without any controls or specializations of the generated sound timbre, leading to the problem that people cannot obtain the desired timbre under these methods sometimes. In this paper, we pose the task of generating sound with a specific timbre given a video input and a reference audio sample. To solve this task, we disentangle each target sound audio into three components: temporal information, acoustic information, and background information. We first use three encoders to encode these components respectively: 1) a temporal encoder to encode temporal information, which is fed with video frames since the input video shares the same temporal information as the original audio; 2) an acoustic encoder to encode timbre information, which takes the original audio as input and discards its temporal information by a temporal-corrupting operation; and 3) a background encoder to encode the residual or background sound, which uses the background part of the original audio as input. To make the generated result achieve better quality and temporal alignment, we also adopt a mel discriminator and a temporal discriminator for the adversarial training. Our experimental results on the VAS dataset demonstrate that our method can generate high-quality audio samples with good synchronization with events in video and high timbre similarity with the reference audio

    C2G2: Controllable Co-speech Gesture Generation with Latent Diffusion Model

    Full text link
    Co-speech gesture generation is crucial for automatic digital avatar animation. However, existing methods suffer from issues such as unstable training and temporal inconsistency, particularly in generating high-fidelity and comprehensive gestures. Additionally, these methods lack effective control over speaker identity and temporal editing of the generated gestures. Focusing on capturing temporal latent information and applying practical controlling, we propose a Controllable Co-speech Gesture Generation framework, named C2G2. Specifically, we propose a two-stage temporal dependency enhancement strategy motivated by latent diffusion models. We further introduce two key features to C2G2, namely a speaker-specific decoder to generate speaker-related real-length skeletons and a repainting strategy for flexible gesture generation/editing. Extensive experiments on benchmark gesture datasets verify the effectiveness of our proposed C2G2 compared with several state-of-the-art baselines. The link of the project demo page can be found at https://c2g2-gesture.github.io/c2_gestureComment: 12 pages, 6 figures, 7 table

    Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

    Full text link
    Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech (TTS) systems. However, previous approaches require substantial annotated training data and additional efforts from language experts, making it difficult to extend high-quality neural TTS systems to out-of-domain daily conversations and countless languages worldwide. This paper tackles the polyphone disambiguation problem from a concise and novel perspective: we propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary (the existing prior information in the natural language). Specifically, we design a semantics-to-pronunciation attention (S2PA) module to match the semantic patterns between the input text sequence and the prior semantics in the dictionary and obtain the corresponding pronunciations; The S2PA module can be easily trained with the end-to-end TTS model without any annotated phoneme labels. Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy and improves the prosody modeling of TTS systems. Further extensive analyses demonstrate that each design in Dict-TTS is effective. The code is available at \url{https://github.com/Zain-Jiang/Dict-TTS}.Comment: Accepted by NeurIPS 202

    Autophagy Inhibitor LRPPRC Suppresses Mitophagy through Interaction with Mitophagy Initiator Parkin

    Get PDF
    Autophagy plays an important role in tumorigenesis. Mitochondrion-associated protein LRPPRC interacts with MAP1S that interacts with LC3 and bridges autophagy components with microtubules and mitochondria to affect autophagy flux. Dysfunction of LRPPRC and MAP1S is associated with poor survival of ovarian cancer patients. Furthermore, elevated levels of LRPPRC predict shorter overall survival in patients with prostate adenocarcinomas or gastric cancer. To understand the role of LRPPRC in tumor development, previously we reported that LRPPRC forms a ternary complex with Beclin 1 and Bcl-2 to inhibit autophagy. Here we further show that LRPPRC maintains the stability of Parkin that mono-ubiquitinates Bcl-2 to increase Bcl-2 stability to inhibit autophagy. Under mitophagy stress, Parkin translocates to mitochondria to cause rupture of outer mitochondrial membrane and bind with exposed LRPPRC. Consequently, LRPPRC and Parkin help mitochondria being engulfed in autophagosomes to be degraded. In cells under long-term mitophagy stress, both LRPPRC and Parkin become depleted coincident with disappearance of mitochondria and final autophagy inactivation due to depletion of ATG5-ATG12 conjugates. LRPPRC functions as a checkpoint protein that prevents mitochondria from autophagy degradation and impact tumorigenesis
    • …
    corecore