499 research outputs found
EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance
Although current neural text-to-speech (TTS) models are able to generate
high-quality speech, intensity controllable emotional TTS is still a
challenging task. Most existing methods need external optimizations for
intensity calculation, leading to suboptimal results or degraded quality. In
this paper, we propose EmoDiff, a diffusion-based TTS model where emotion
intensity can be manipulated by a proposed soft-label guidance technique
derived from classifier guidance. Specifically, instead of being guided with a
one-hot vector for the specified emotion, EmoDiff is guided with a soft label
where the value of the specified emotion and \textit{Neutral} is set to
and respectively. The here represents the emotion
intensity and can be chosen from 0 to 1. Our experiments show that EmoDiff can
precisely control the emotion intensity while maintaining high voice quality.
Moreover, diverse speech with specified emotion intensity can be generated by
sampling in the reverse denoising process.Comment: Accepted to ICASSP202
VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature
The mainstream neural text-to-speech(TTS) pipeline is a cascade system,
including an acoustic model(AM) that predicts acoustic feature from the input
transcript and a vocoder that generates waveform according to the given
acoustic feature. However, the acoustic feature in current TTS systems is
typically mel-spectrogram, which is highly correlated along both time and
frequency axes in a complicated way, leading to a great difficulty for the AM
to predict. Although high-fidelity audio can be generated by recent neural
vocoders from ground-truth(GT) mel-spectrogram, the gap between the GT and the
predicted mel-spectrogram from AM degrades the performance of the entire TTS
system. In this work, we propose VQTTS, consisting of an AM txt2vec and a
vocoder vec2wav, which uses self-supervised vector-quantized(VQ) acoustic
feature rather than mel-spectrogram. We redesign both the AM and the vocoder
accordingly. In particular, txt2vec basically becomes a classification model
instead of a traditional regression model while vec2wav uses an additional
feature encoder before HifiGAN generator for smoothing the discontinuous
quantized feature. Our experiments show that vec2wav achieves better
reconstruction performance than HifiGAN when using self-supervised VQ acoustic
feature. Moreover, our entire TTS system VQTTS achieves state-of-the-art
performance in terms of naturalness among all current publicly available TTS
systems.Comment: This version has been removed by arXiv administrators because the
submitter did not have the authority to assign the license at the time of
submissio
VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching
Although diffusion models in text-to-speech have become a popular choice due
to their strong generative ability, the intrinsic complexity of sampling from
diffusion models harms their efficiency. Alternatively, we propose VoiceFlow,
an acoustic model that utilizes a rectified flow matching algorithm to achieve
high synthesis quality with a limited number of sampling steps. VoiceFlow
formulates the process of generating mel-spectrograms into an ordinary
differential equation conditional on text inputs, whose vector field is then
estimated. The rectified flow technique then effectively straightens its
sampling trajectory for efficient synthesis. Subjective and objective
evaluations on both single and multi-speaker corpora showed the superior
synthesis quality of VoiceFlow compared to the diffusion counterpart. Ablation
studies further verified the validity of the rectified flow technique in
VoiceFlow.Comment: 4 figure, 5 pages, submitted to ICASSP 202
Research on the Development Path of Rural Industry Revitalization by Non-Legacy of Folklore: Taking Chaoshan Meeting of p county in Linfen City as an Example
with the development of digital economy, the protection and inheritance of intangible cultural heritage has become an important issue to be solved. As an important part of Chinese traditional culture, non-heritage of folk-custom carries rich historical and cultural connotation and national spirit. However, with the changes of the times and the widening gap between urban and rural development, many non-heritage projects of folk-custom are facing the crisis of loss. As a folk celebration with profound historical and cultural background, Chaoshan meeting in p county of Linfen City, Shanxi province carries rich local cultural connotation. This paper aims to analyze the present situation of Chaoshan society in Linfen city, Shanxi Province, and discuss the development dilemma of micro-, middle-and macro-levels, and put forward a targeted development path to promote the revitalization of rural industries and inheritance of non-heritage protection work in-depth development
Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations
Expressive text-to-speech (TTS) aims to synthesize speeches with human-like
tones, moods, or even artistic attributes. Recent advancements in expressive
TTS empower users with the ability to directly control synthesis style through
natural language prompts. However, these methods often require excessive
training with a significant amount of style-annotated data, which can be
challenging to acquire. Moreover, they may have limited adaptability due to
fixed style annotations. In this work, we present FreeStyleTTS (FS-TTS), a
controllable expressive TTS model with minimal human annotations. Our approach
utilizes a large language model (LLM) to transform expressive TTS into a style
retrieval task. The LLM selects the best-matching style references from
annotated utterances based on external style prompts, which can be raw input
text or natural language style descriptions. The selected reference guides the
TTS pipeline to synthesize speeches with the intended style. This innovative
approach provides flexible, versatile, and precise style control with minimal
human workload. Experiments on a Mandarin storytelling corpus demonstrate
FS-TTS's proficiency in leveraging LLM's semantic inference ability to retrieve
desired styles from either input text or user-defined descriptions. This
results in synthetic speeches that are closely aligned with the specified
styles.Comment: 5 pages,3 figures, submitted to ICASSP 202
Acoustic BPE for Speech Generation with Discrete Tokens
Discrete audio tokens derived from self-supervised learning models have
gained widespread usage in speech generation. However, current practice of
directly utilizing audio tokens poses challenges for sequence modeling due to
the length of the token sequence. Additionally, this approach places the burden
on the model to establish correlations between tokens, further complicating the
modeling process. To address this issue, we propose acoustic BPE which encodes
frequent audio token patterns by utilizing byte-pair encoding. Acoustic BPE
effectively reduces the sequence length and leverages the prior morphological
information present in token sequence, which alleviates the modeling challenges
of token correlation. Through comprehensive investigations on a speech language
model trained with acoustic BPE, we confirm the notable advantages it offers,
including faster inference and improved syntax capturing capabilities. In
addition, we propose a novel rescore method to select the optimal synthetic
speech among multiple candidates generated by rich-diversity TTS system.
Experiments prove that rescore selection aligns closely with human preference,
which highlights acoustic BPE's potential to other speech generation tasks.Comment: 5 pages, 2 figures; accepted to ICASSP 202
Combined amino acids modulation with H2O2 stress for glutathione overproduction in Candida utilis
Strategies of amino acids addition coupled with H2O2 stresses were developed for glutathione (GSH) overproduction in high cell density (HCD) cultivation of Candida utilis. Based on the fact that glycine shows two functions of promoting cells growth as well as GSH production, precursor amino acids modulations of feeding glycine at 4 mmol/l/h at exponential phase and adding precursor amino acids (glutamic acid 42 mmol/l, glycine 40 mmol/l, and cysteine 36 mmol/) at stationary phase were conducted. As a result, cell density reached 114.8 g/l at 45 h and glutathione yield of 2136 mg/l was achieved at 60 h, which was 12.5 and 90.2% higher than the control, respectively. Furthermore, the novel strategies of amino acids modulation combined with H2O2 additions (24 mmol/l at 21 h, 26 mmol/l at 29 h, 28 mmol/l at 37 h and 30 mmol/l at 45 h) were adopted to maximize glutathione production. Final glutathione yield reached 2448 mg/l after 60 h cultivation, suggesting the strategies developed as being feasible for GSH overproduction. Keywords: Amino acids, glutathione (GSH), high cell density (HCD) cultivation, Candida utilis, H2O2 stressesAfrican Journal of Biotechnology Vol. 9(33), pp. 5399-5406, 16 August, 201
The Evaluation of Toxicity Induced by Psoraleae Fructus in Rats Using Untargeted Metabonomic Method Based on UPLC-Q-TOF/MS
Psoraleae Fructus is the dry and mature fruit of leguminous plant Psoralea corylifolia L., with the activity of warming kidney and enhancing yang, warming spleen, and other effects. However, large doses can cause liver and kidney toxicity. Therefore, it is necessary to evaluate the toxicity of Psoraleae Fructus systematically. Although traditional biochemical indicators and pathological tests have been used to evaluate the safety of drug, these methods lack sensitivity and specificity, so a fast and sensitive analytical method is urgently needed. In this study, an ultraperformance liquid chromatography coupled with quadrupole time-of-flight mass spectrometry (UPLC-Q-TOF/MS) method was used to analyze the metabolic profiles of rat plasma. The changes of metabolites in plasma samples were detected by partial least squares-discriminant analysis (PLS-DA). Compared with the control group, after 7 days of administration, the pathological sections showed liver and kidney toxicity, and the metabolic trend was changed. Finally, 13 potential biomarkers related to the toxicity of Psoraleae Fructus were screened. The metabolic pathways involved were glycerol phospholipids metabolism, amino acid metabolism, energy metabolism, and so forth. The discovery of these biomarkers laid a foundation for better explaining the hepatotoxicity and nephrotoxicity of Psoraleae Fructus and provided a guarantee for its safety evaluation
Cloning and expression of pineapple sucrosephosphate synthase gene during fruit development
A 1132-base pairs (bp) polymerase-chain-reaction product of sucrose-phosphate synthase (SPS) (EC 2.3.1.14) from pineapple (Ananas comosus cv. Comte de paris) fruit was cloned and nominated as Ac- SPS1. The sequence encodes a putative 377 amino acids protein containing two serine conserved features that had been found in other plant SPS genes: the presence of a 14-3-3 protein special binding domain and an activated site of osmosis stress, which can been activated by phosphorylation and dephosphorylation. The Neighbour-joining tree revealed that Ac-SPS1 belonged to the first kind of sucrose phosphate synthase gene. The results indicated that, the Ac-SPS1 expression was low in the earlier period of fruit growth, then, increasing from 20 days after anthesis and gradually a falling on 40 days, reached the peak with the highest value around 70 days. The SPS activity and sucrose content reached their maximum 80 days after anthesis. It proved that the accumulation of sucrose was correlated with SPS activity and mRNA content and it maximally occurred at 10 d after SPS mRNA and activity had reached its maxima. These results indicated that Ac-SPS1 gene played a key role in sucrose accumulation during the pineapple fruit development and transcriptional activation with increase in Ac- SPS1 expression might be important regulatory events of sugar during pineapple fruit maturation.Key words: Pineapple fruit, sucrose phosphate synthase, gene cloning, expression
UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding
The utilization of discrete speech tokens, divided into semantic tokens and
acoustic tokens, has been proven superior to traditional acoustic feature
mel-spectrograms in terms of naturalness and robustness for text-to-speech
(TTS) synthesis. Recent popular models, such as VALL-E and SPEAR-TTS, allow
zero-shot speaker adaptation through auto-regressive (AR) continuation of
acoustic tokens extracted from a short speech prompt. However, these AR models
are restricted to generate speech only in a left-to-right direction, making
them unsuitable for speech editing where both preceding and following contexts
are provided. Furthermore, these models rely on acoustic tokens, which have
audio quality limitations imposed by the performance of audio codec models. In
this study, we propose a unified context-aware TTS framework called UniCATS,
which is capable of both speech continuation and editing. UniCATS comprises two
components, an acoustic model CTX-txt2vec and a vocoder CTX-vec2wav.
CTX-txt2vec employs contextual VQ-diffusion to predict semantic tokens from the
input text, enabling it to incorporate the semantic context and maintain
seamless concatenation with the surrounding context. Following that,
CTX-vec2wav utilizes contextual vocoding to convert these semantic tokens into
waveforms, taking into consideration the acoustic context. Our experimental
results demonstrate that CTX-vec2wav outperforms HifiGAN and AudioLM in terms
of speech resynthesis from semantic tokens. Moreover, we show that UniCATS
achieves state-of-the-art performance in both speech continuation and editing
- …