14 research outputs found
AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation
Direct speech-to-speech translation (S2ST) aims to convert speech from one
language into another, and has demonstrated significant progress to date.
Despite the recent success, current S2ST models still suffer from distinct
degradation in noisy environments and fail to translate visual speech (i.e.,
the movement of lips and teeth). In this work, we present AV-TranSpeech, the
first audio-visual speech-to-speech (AV-S2ST) translation model without relying
on intermediate text. AV-TranSpeech complements the audio stream with visual
information to promote system robustness and opens up a host of practical
applications: dictation or dubbing archival films. To mitigate the data
scarcity with limited parallel AV-S2ST data, we 1) explore self-supervised
pre-training with unlabeled audio-visual data to learn contextual
representation, and 2) introduce cross-modal distillation with S2ST models
trained on the audio-only corpus to further reduce the requirements of visual
data. Experimental results on two language pairs demonstrate that AV-TranSpeech
outperforms audio-only models under all settings regardless of the type of
noise. With low-resource audio-visual data (10h, 30h), cross-modal distillation
yields an improvement of 7.6 BLEU on average compared with baselines. Audio
samples are available at https://AV-TranSpeech.github.ioComment: Accepted to ACL 202
Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts
Zero-shot text-to-speech aims at synthesizing voices with unseen speech
prompts. Previous large-scale multispeaker TTS models have successfully
achieved this goal with an enrolled recording within 10 seconds. However, most
of them are designed to utilize only short speech prompts. The limited
information in short speech prompts significantly hinders the performance of
fine-grained identity imitation. In this paper, we introduce Mega-TTS 2, a
generic zero-shot multispeaker TTS model that is capable of synthesizing speech
for unseen speakers with arbitrary-length prompts. Specifically, we 1) design a
multi-reference timbre encoder to extract timbre information from multiple
reference speeches; 2) and train a prosody language model with arbitrary-length
speech prompts; With these designs, our model is suitable for prompts of
different lengths, which extends the upper bound of speech quality for
zero-shot text-to-speech. Besides arbitrary-length prompts, we introduce
arbitrary-source prompts, which leverages the probabilities derived from
multiple P-LLM outputs to produce expressive and controlled prosody.
Furthermore, we propose a phoneme-level auto-regressive duration model to
introduce in-context learning capabilities to duration modeling. Experiments
demonstrate that our method could not only synthesize identity-preserving
speech with a short prompt of an unseen speaker but also achieve improved
performance with longer speech prompts. Audio samples can be found in
https://mega-tts.github.io/mega2_demo/
Transcriptome analysis revealed the dynamic oil accumulation in Symplocos paniculata fruit
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation
Direct speech-to-speech translation (S2ST) systems leverage recent progress
in speech representation learning, where a sequence of discrete representations
(units) derived in a self-supervised manner, are predicted from the model and
passed to a vocoder for speech synthesis, still facing the following
challenges: 1) Acoustic multimodality: the discrete units derived from speech
with same content could be indeterministic due to the acoustic property (e.g.,
rhythm, pitch, and energy), which causes deterioration of translation accuracy;
2) high latency: current S2ST systems utilize autoregressive models which
predict each unit conditioned on the sequence previously generated, failing to
take full advantage of parallelism. In this work, we propose TranSpeech, a
speech-to-speech translation model with bilateral perturbation. To alleviate
the acoustic multimodal problem, we propose bilateral perturbation, which
consists of the style normalization and information enhancement stages, to
learn only the linguistic information from speech samples and generate more
deterministic representations. With reduced multimodality, we step forward and
become the first to establish a non-autoregressive S2ST technique, which
repeatedly masks and predicts unit choices and produces high-accuracy results
in just a few cycles. Experimental results on three language pairs demonstrate
the state-of-the-art results by up to 2.5 BLEU points over the best
publicly-available textless S2ST baseline. Moreover, TranSpeech shows a
significant improvement in inference latency, enabling speedup up to 21.4x than
autoregressive technique. Audio samples are available at
\url{https://TranSpeech.github.io/
Generative Zero-Shot Prompt Learning for Cross-Domain Slot Filling with Inverse Prompting
Zero-shot cross-domain slot filling aims to transfer knowledge from the
labeled source domain to the unlabeled target domain. Existing models either
encode slot descriptions and examples or design handcrafted question templates
using heuristic rules, suffering from poor generalization capability or
robustness. In this paper, we propose a generative zero-shot prompt learning
framework for cross-domain slot filling, both improving generalization and
robustness than previous work. Besides, we introduce a novel inverse prompting
strategy to distinguish different slot types to avoid the multiple prediction
problem, and an efficient prompt-tuning strategy to boost higher performance by
only training fewer prompt parameters. Experiments and analysis demonstrate the
effectiveness of our proposed framework, especially huge improvements (+13.44%
F1) on the unseen slots.Comment: Accepted by the Findings of ACL202
Nitric oxide induces cotyledon senescence involving co-operation of the NES1/MAD1 and EIN2-associated ORE1 signalling pathways in Arabidopsis
Tetrahydrofolate Modulates Floral Transition through Epigenetic Silencing.
Folates, termed from tetrahydrofolate (THF) and its derivatives, function as coenzymes in one-carbon transfer reactions and play a central role in synthesis of nucleotides and amino acids. Dysfunction of cellular folate metabolism leads to serious defects in plant development; however, the molecular mechanisms of folate-mediated cellular modifications and physiological responses in plants are still largely unclear. Here, we reported that THF controls flowering time by adjusting DNA methylation-regulated gene expression in Arabidopsis (Arabidopsis thaliana). Wild-type seedlings supplied with THF as well as the high endogenous THF content mutant dihydrofolate synthetase folypoly-Glu synthetase homolog B exhibited significant up-regulation of the flowering repressor of Flowering Wageningen and thereby delaying floral transition in a dose-dependent manner. Genome-wide transcripts and DNA methylation profiling revealed that THF reduces DNA methylation so as to manipulate gene expression activity. Moreover, in accompaniment with elevated cellular ratios between monoglutamylated and polyglutamylated folates under increased THF levels, the content of S-adenosylhomo-Cys, a competitive inhibitor of methyltransferases, was obviously higher, indicating that enhanced THF accumulation may disturb cellular homeostasis of the concerted reactions between folate polyglutamylation and folate-dependent DNA methylation. In addition, we found that the loss-of-function mutant of CG DNA methyltransferase MET1 displayed much less responsiveness to THF-associated flowering time alteration. Taken together, our studies revealed a novel regulatory role of THF on epigenetic silencing, which will shed lights on the understanding of interrelations in folate homeostasis, epigenetic variation, and flowering control in plants