14 research outputs found

    AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

    Full text link
    Direct speech-to-speech translation (S2ST) aims to convert speech from one language into another, and has demonstrated significant progress to date. Despite the recent success, current S2ST models still suffer from distinct degradation in noisy environments and fail to translate visual speech (i.e., the movement of lips and teeth). In this work, we present AV-TranSpeech, the first audio-visual speech-to-speech (AV-S2ST) translation model without relying on intermediate text. AV-TranSpeech complements the audio stream with visual information to promote system robustness and opens up a host of practical applications: dictation or dubbing archival films. To mitigate the data scarcity with limited parallel AV-S2ST data, we 1) explore self-supervised pre-training with unlabeled audio-visual data to learn contextual representation, and 2) introduce cross-modal distillation with S2ST models trained on the audio-only corpus to further reduce the requirements of visual data. Experimental results on two language pairs demonstrate that AV-TranSpeech outperforms audio-only models under all settings regardless of the type of noise. With low-resource audio-visual data (10h, 30h), cross-modal distillation yields an improvement of 7.6 BLEU on average compared with baselines. Audio samples are available at https://AV-TranSpeech.github.ioComment: Accepted to ACL 202

    Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts

    Full text link
    Zero-shot text-to-speech aims at synthesizing voices with unseen speech prompts. Previous large-scale multispeaker TTS models have successfully achieved this goal with an enrolled recording within 10 seconds. However, most of them are designed to utilize only short speech prompts. The limited information in short speech prompts significantly hinders the performance of fine-grained identity imitation. In this paper, we introduce Mega-TTS 2, a generic zero-shot multispeaker TTS model that is capable of synthesizing speech for unseen speakers with arbitrary-length prompts. Specifically, we 1) design a multi-reference timbre encoder to extract timbre information from multiple reference speeches; 2) and train a prosody language model with arbitrary-length speech prompts; With these designs, our model is suitable for prompts of different lengths, which extends the upper bound of speech quality for zero-shot text-to-speech. Besides arbitrary-length prompts, we introduce arbitrary-source prompts, which leverages the probabilities derived from multiple P-LLM outputs to produce expressive and controlled prosody. Furthermore, we propose a phoneme-level auto-regressive duration model to introduce in-context learning capabilities to duration modeling. Experiments demonstrate that our method could not only synthesize identity-preserving speech with a short prompt of an unseen speaker but also achieve improved performance with longer speech prompts. Audio samples can be found in https://mega-tts.github.io/mega2_demo/

    TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation

    Full text link
    Direct speech-to-speech translation (S2ST) systems leverage recent progress in speech representation learning, where a sequence of discrete representations (units) derived in a self-supervised manner, are predicted from the model and passed to a vocoder for speech synthesis, still facing the following challenges: 1) Acoustic multimodality: the discrete units derived from speech with same content could be indeterministic due to the acoustic property (e.g., rhythm, pitch, and energy), which causes deterioration of translation accuracy; 2) high latency: current S2ST systems utilize autoregressive models which predict each unit conditioned on the sequence previously generated, failing to take full advantage of parallelism. In this work, we propose TranSpeech, a speech-to-speech translation model with bilateral perturbation. To alleviate the acoustic multimodal problem, we propose bilateral perturbation, which consists of the style normalization and information enhancement stages, to learn only the linguistic information from speech samples and generate more deterministic representations. With reduced multimodality, we step forward and become the first to establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices and produces high-accuracy results in just a few cycles. Experimental results on three language pairs demonstrate the state-of-the-art results by up to 2.5 BLEU points over the best publicly-available textless S2ST baseline. Moreover, TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique. Audio samples are available at \url{https://TranSpeech.github.io/

    Generative Zero-Shot Prompt Learning for Cross-Domain Slot Filling with Inverse Prompting

    Full text link
    Zero-shot cross-domain slot filling aims to transfer knowledge from the labeled source domain to the unlabeled target domain. Existing models either encode slot descriptions and examples or design handcrafted question templates using heuristic rules, suffering from poor generalization capability or robustness. In this paper, we propose a generative zero-shot prompt learning framework for cross-domain slot filling, both improving generalization and robustness than previous work. Besides, we introduce a novel inverse prompting strategy to distinguish different slot types to avoid the multiple prediction problem, and an efficient prompt-tuning strategy to boost higher performance by only training fewer prompt parameters. Experiments and analysis demonstrate the effectiveness of our proposed framework, especially huge improvements (+13.44% F1) on the unseen slots.Comment: Accepted by the Findings of ACL202

    Tetrahydrofolate Modulates Floral Transition through Epigenetic Silencing.

    Full text link
    Folates, termed from tetrahydrofolate (THF) and its derivatives, function as coenzymes in one-carbon transfer reactions and play a central role in synthesis of nucleotides and amino acids. Dysfunction of cellular folate metabolism leads to serious defects in plant development; however, the molecular mechanisms of folate-mediated cellular modifications and physiological responses in plants are still largely unclear. Here, we reported that THF controls flowering time by adjusting DNA methylation-regulated gene expression in Arabidopsis (Arabidopsis thaliana). Wild-type seedlings supplied with THF as well as the high endogenous THF content mutant dihydrofolate synthetase folypoly-Glu synthetase homolog B exhibited significant up-regulation of the flowering repressor of Flowering Wageningen and thereby delaying floral transition in a dose-dependent manner. Genome-wide transcripts and DNA methylation profiling revealed that THF reduces DNA methylation so as to manipulate gene expression activity. Moreover, in accompaniment with elevated cellular ratios between monoglutamylated and polyglutamylated folates under increased THF levels, the content of S-adenosylhomo-Cys, a competitive inhibitor of methyltransferases, was obviously higher, indicating that enhanced THF accumulation may disturb cellular homeostasis of the concerted reactions between folate polyglutamylation and folate-dependent DNA methylation. In addition, we found that the loss-of-function mutant of CG DNA methyltransferase MET1 displayed much less responsiveness to THF-associated flowering time alteration. Taken together, our studies revealed a novel regulatory role of THF on epigenetic silencing, which will shed lights on the understanding of interrelations in folate homeostasis, epigenetic variation, and flowering control in plants
    corecore