499 research outputs found

    EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance

    Full text link
    Although current neural text-to-speech (TTS) models are able to generate high-quality speech, intensity controllable emotional TTS is still a challenging task. Most existing methods need external optimizations for intensity calculation, leading to suboptimal results or degraded quality. In this paper, we propose EmoDiff, a diffusion-based TTS model where emotion intensity can be manipulated by a proposed soft-label guidance technique derived from classifier guidance. Specifically, instead of being guided with a one-hot vector for the specified emotion, EmoDiff is guided with a soft label where the value of the specified emotion and \textit{Neutral} is set to α\alpha and 1−α1-\alpha respectively. The α\alpha here represents the emotion intensity and can be chosen from 0 to 1. Our experiments show that EmoDiff can precisely control the emotion intensity while maintaining high voice quality. Moreover, diverse speech with specified emotion intensity can be generated by sampling in the reverse denoising process.Comment: Accepted to ICASSP202

    VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature

    Full text link
    The mainstream neural text-to-speech(TTS) pipeline is a cascade system, including an acoustic model(AM) that predicts acoustic feature from the input transcript and a vocoder that generates waveform according to the given acoustic feature. However, the acoustic feature in current TTS systems is typically mel-spectrogram, which is highly correlated along both time and frequency axes in a complicated way, leading to a great difficulty for the AM to predict. Although high-fidelity audio can be generated by recent neural vocoders from ground-truth(GT) mel-spectrogram, the gap between the GT and the predicted mel-spectrogram from AM degrades the performance of the entire TTS system. In this work, we propose VQTTS, consisting of an AM txt2vec and a vocoder vec2wav, which uses self-supervised vector-quantized(VQ) acoustic feature rather than mel-spectrogram. We redesign both the AM and the vocoder accordingly. In particular, txt2vec basically becomes a classification model instead of a traditional regression model while vec2wav uses an additional feature encoder before HifiGAN generator for smoothing the discontinuous quantized feature. Our experiments show that vec2wav achieves better reconstruction performance than HifiGAN when using self-supervised VQ acoustic feature. Moreover, our entire TTS system VQTTS achieves state-of-the-art performance in terms of naturalness among all current publicly available TTS systems.Comment: This version has been removed by arXiv administrators because the submitter did not have the authority to assign the license at the time of submissio

    VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching

    Full text link
    Although diffusion models in text-to-speech have become a popular choice due to their strong generative ability, the intrinsic complexity of sampling from diffusion models harms their efficiency. Alternatively, we propose VoiceFlow, an acoustic model that utilizes a rectified flow matching algorithm to achieve high synthesis quality with a limited number of sampling steps. VoiceFlow formulates the process of generating mel-spectrograms into an ordinary differential equation conditional on text inputs, whose vector field is then estimated. The rectified flow technique then effectively straightens its sampling trajectory for efficient synthesis. Subjective and objective evaluations on both single and multi-speaker corpora showed the superior synthesis quality of VoiceFlow compared to the diffusion counterpart. Ablation studies further verified the validity of the rectified flow technique in VoiceFlow.Comment: 4 figure, 5 pages, submitted to ICASSP 202

    Research on the Development Path of Rural Industry Revitalization by Non-Legacy of Folklore: Taking Chaoshan Meeting of p county in Linfen City as an Example

    Get PDF
    with the development of digital economy, the protection and inheritance of intangible cultural heritage has become an important issue to be solved. As an important part of Chinese traditional culture, non-heritage of folk-custom carries rich historical and cultural connotation and national spirit. However, with the changes of the times and the widening gap between urban and rural development, many non-heritage projects of folk-custom are facing the crisis of loss. As a folk celebration with profound historical and cultural background, Chaoshan meeting in p county of Linfen City, Shanxi province carries rich local cultural connotation. This paper aims to analyze the present situation of Chaoshan society in Linfen city, Shanxi Province, and discuss the development dilemma of micro-, middle-and macro-levels, and put forward a targeted development path to promote the revitalization of rural industries and inheritance of non-heritage protection work in-depth development

    Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations

    Full text link
    Expressive text-to-speech (TTS) aims to synthesize speeches with human-like tones, moods, or even artistic attributes. Recent advancements in expressive TTS empower users with the ability to directly control synthesis style through natural language prompts. However, these methods often require excessive training with a significant amount of style-annotated data, which can be challenging to acquire. Moreover, they may have limited adaptability due to fixed style annotations. In this work, we present FreeStyleTTS (FS-TTS), a controllable expressive TTS model with minimal human annotations. Our approach utilizes a large language model (LLM) to transform expressive TTS into a style retrieval task. The LLM selects the best-matching style references from annotated utterances based on external style prompts, which can be raw input text or natural language style descriptions. The selected reference guides the TTS pipeline to synthesize speeches with the intended style. This innovative approach provides flexible, versatile, and precise style control with minimal human workload. Experiments on a Mandarin storytelling corpus demonstrate FS-TTS's proficiency in leveraging LLM's semantic inference ability to retrieve desired styles from either input text or user-defined descriptions. This results in synthetic speeches that are closely aligned with the specified styles.Comment: 5 pages,3 figures, submitted to ICASSP 202

    Acoustic BPE for Speech Generation with Discrete Tokens

    Full text link
    Discrete audio tokens derived from self-supervised learning models have gained widespread usage in speech generation. However, current practice of directly utilizing audio tokens poses challenges for sequence modeling due to the length of the token sequence. Additionally, this approach places the burden on the model to establish correlations between tokens, further complicating the modeling process. To address this issue, we propose acoustic BPE which encodes frequent audio token patterns by utilizing byte-pair encoding. Acoustic BPE effectively reduces the sequence length and leverages the prior morphological information present in token sequence, which alleviates the modeling challenges of token correlation. Through comprehensive investigations on a speech language model trained with acoustic BPE, we confirm the notable advantages it offers, including faster inference and improved syntax capturing capabilities. In addition, we propose a novel rescore method to select the optimal synthetic speech among multiple candidates generated by rich-diversity TTS system. Experiments prove that rescore selection aligns closely with human preference, which highlights acoustic BPE's potential to other speech generation tasks.Comment: 5 pages, 2 figures; accepted to ICASSP 202

    Combined amino acids modulation with H2O2 stress for glutathione overproduction in Candida utilis

    Get PDF
    Strategies of amino acids addition coupled with H2O2 stresses were developed for glutathione (GSH) overproduction in high cell density (HCD) cultivation of Candida utilis. Based on the fact that glycine shows two functions of promoting cells growth as well as GSH production, precursor amino acids modulations of feeding glycine at 4 mmol/l/h at exponential phase and adding precursor amino acids (glutamic acid 42 mmol/l, glycine 40 mmol/l, and cysteine 36 mmol/) at stationary phase were conducted. As a result, cell density reached 114.8 g/l at 45 h and glutathione yield of 2136 mg/l was achieved at 60 h, which was 12.5 and 90.2% higher than the control, respectively. Furthermore, the novel strategies of amino acids modulation combined with H2O2 additions (24 mmol/l at 21 h, 26 mmol/l at 29 h, 28 mmol/l at 37 h and 30 mmol/l at 45 h) were adopted to maximize glutathione production. Final glutathione yield reached 2448 mg/l after 60 h cultivation, suggesting the strategies developed as being feasible for GSH overproduction. Keywords: Amino acids, glutathione (GSH), high cell density (HCD) cultivation, Candida utilis, H2O2 stressesAfrican Journal of Biotechnology Vol. 9(33), pp. 5399-5406, 16 August, 201

    The Evaluation of Toxicity Induced by Psoraleae Fructus in Rats Using Untargeted Metabonomic Method Based on UPLC-Q-TOF/MS

    Get PDF
    Psoraleae Fructus is the dry and mature fruit of leguminous plant Psoralea corylifolia L., with the activity of warming kidney and enhancing yang, warming spleen, and other effects. However, large doses can cause liver and kidney toxicity. Therefore, it is necessary to evaluate the toxicity of Psoraleae Fructus systematically. Although traditional biochemical indicators and pathological tests have been used to evaluate the safety of drug, these methods lack sensitivity and specificity, so a fast and sensitive analytical method is urgently needed. In this study, an ultraperformance liquid chromatography coupled with quadrupole time-of-flight mass spectrometry (UPLC-Q-TOF/MS) method was used to analyze the metabolic profiles of rat plasma. The changes of metabolites in plasma samples were detected by partial least squares-discriminant analysis (PLS-DA). Compared with the control group, after 7 days of administration, the pathological sections showed liver and kidney toxicity, and the metabolic trend was changed. Finally, 13 potential biomarkers related to the toxicity of Psoraleae Fructus were screened. The metabolic pathways involved were glycerol phospholipids metabolism, amino acid metabolism, energy metabolism, and so forth. The discovery of these biomarkers laid a foundation for better explaining the hepatotoxicity and nephrotoxicity of Psoraleae Fructus and provided a guarantee for its safety evaluation

    Cloning and expression of pineapple sucrosephosphate synthase gene during fruit development

    Get PDF
    A 1132-base pairs (bp) polymerase-chain-reaction product of sucrose-phosphate synthase (SPS) (EC 2.3.1.14) from pineapple (Ananas comosus cv. Comte de paris) fruit was cloned and nominated as Ac- SPS1. The sequence encodes a putative 377 amino acids protein containing two serine conserved features that had been found in other plant SPS genes: the presence of a 14-3-3 protein special binding domain and an activated site of osmosis stress, which can been activated by phosphorylation and dephosphorylation. The Neighbour-joining tree revealed that Ac-SPS1 belonged to the first kind of sucrose phosphate synthase gene. The results indicated that, the Ac-SPS1 expression was low in the earlier period of fruit growth, then, increasing from 20 days after anthesis and gradually a falling on 40 days, reached the peak with the highest value around 70 days. The SPS activity and sucrose content reached their maximum 80 days after anthesis. It proved that the  accumulation of sucrose was correlated with SPS activity and mRNA content and it maximally occurred at 10 d after SPS mRNA and activity had reached its maxima. These results indicated that Ac-SPS1 gene played a key role in sucrose accumulation during the pineapple fruit development and transcriptional activation with increase in Ac- SPS1 expression might be important regulatory events of sugar during pineapple fruit maturation.Key words: Pineapple fruit, sucrose phosphate synthase, gene cloning, expression

    UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding

    Full text link
    The utilization of discrete speech tokens, divided into semantic tokens and acoustic tokens, has been proven superior to traditional acoustic feature mel-spectrograms in terms of naturalness and robustness for text-to-speech (TTS) synthesis. Recent popular models, such as VALL-E and SPEAR-TTS, allow zero-shot speaker adaptation through auto-regressive (AR) continuation of acoustic tokens extracted from a short speech prompt. However, these AR models are restricted to generate speech only in a left-to-right direction, making them unsuitable for speech editing where both preceding and following contexts are provided. Furthermore, these models rely on acoustic tokens, which have audio quality limitations imposed by the performance of audio codec models. In this study, we propose a unified context-aware TTS framework called UniCATS, which is capable of both speech continuation and editing. UniCATS comprises two components, an acoustic model CTX-txt2vec and a vocoder CTX-vec2wav. CTX-txt2vec employs contextual VQ-diffusion to predict semantic tokens from the input text, enabling it to incorporate the semantic context and maintain seamless concatenation with the surrounding context. Following that, CTX-vec2wav utilizes contextual vocoding to convert these semantic tokens into waveforms, taking into consideration the acoustic context. Our experimental results demonstrate that CTX-vec2wav outperforms HifiGAN and AudioLM in terms of speech resynthesis from semantic tokens. Moreover, we show that UniCATS achieves state-of-the-art performance in both speech continuation and editing
    • …
    corecore