Search CORE

58 research outputs found

Towards High-Quality Neural TTS for Low-Resource Languages by Learning Compact Speech Representations

Author: Guo Haohan
Lu Hui
Meng Helen
Wu Xixin
Xie Fenglong
Publication venue
Publication date: 26/10/2022
Field of study

This paper aims to enhance low-resource TTS by reducing training data requirements using compact speech representations. A Multi-Stage Multi-Codebook (MSMC) VQ-GAN is trained to learn the representation, MSMCR, and decode it to waveforms. Subsequently, we train the multi-stage predictor to predict MSMCRs from the text for TTS synthesis. Moreover, we optimize the training strategy by leveraging more audio to learn MSMCRs better for low-resource languages. It selects audio from other languages using speaker similarity metric to augment the training set, and applies transfer learning to improve training quality. In MOS tests, the proposed system significantly outperforms FastSpeech and VITS in standard and low-resource scenarios, showing lower data requirements. The proposed training strategy effectively enhances MSMCRs on waveform reconstruction. It improves TTS performance further, which wins 77% votes in the preference test for the low-resource TTS with only 15 minutes of paired data.Comment: Submitted to ICASSP 202

arXiv.org e-Print Archive

A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural TTS

Author: Guo Haohan
Meng Helen
Soong Frank K.
Wu Xixin
Xie Fenglong
Publication venue
Publication date: 22/09/2022
Field of study

We propose a Multi-Stage, Multi-Codebook (MSMC) approach to high-performance neural TTS synthesis. A vector-quantized, variational autoencoder (VQ-VAE) based feature analyzer is used to encode Mel spectrograms of speech training data by down-sampling progressively in multiple stages into MSMC Representations (MSMCRs) with different time resolutions, and quantizing them with multiple VQ codebooks, respectively. Multi-stage predictors are trained to map the input text sequence to MSMCRs progressively by minimizing a combined loss of the reconstruction Mean Square Error (MSE) and "triplet loss". In synthesis, the neural vocoder converts the predicted MSMCRs into final speech waveforms. The proposed approach is trained and tested with an English TTS database of 16 hours by a female speaker. The proposed TTS achieves an MOS score of 4.41, which outperforms the baseline with an MOS of 3.62. Compact versions of the proposed TTS with much less parameters can still preserve high MOS scores. Ablation studies show that both multiple stages and multiple codebooks are effective for achieving high TTS performance

arXiv.org e-Print Archive

QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning

Author: Guo Haohan
Kang Jiawen
Meng Helen
Wu Xixin
Xiao Yujia
Xie Fenglong
Publication venue
Publication date: 31/08/2023
Field of study

This paper proposes a novel semi-supervised TTS framework, QS-TTS, to improve TTS quality with lower supervised data requirements via Vector-Quantized Self-Supervised Speech Representation Learning (VQ-S3RL) utilizing more unlabeled speech audio. This framework comprises two VQ-S3R learners: first, the principal learner aims to provide a generative Multi-Stage Multi-Codebook (MSMC) VQ-S3R via the MSMC-VQ-GAN combined with the contrastive S3RL, while decoding it back to the high-quality audio; then, the associate learner further abstracts the MSMC representation into a highly-compact VQ representation through a VQ-VAE. These two generative VQ-S3R learners provide profitable speech representations and pre-trained models for TTS, significantly improving synthesis quality with the lower requirement for supervised data. QS-TTS is evaluated comprehensively under various scenarios via subjective and objective tests in experiments. The results powerfully demonstrate the superior performance of QS-TTS, winning the highest MOS over supervised or semi-supervised baseline TTS approaches, especially in low-resource scenarios. Moreover, comparing various speech representations and transfer learning methods in TTS further validates the notable improvement of the proposed VQ-S3RL to TTS, showing the best audio quality and intelligibility metrics. The trend of slower decay in the synthesis quality of QS-TTS with decreasing supervised data further highlights its lower requirements for supervised data, indicating its great potential in low-resource scenarios

arXiv.org e-Print Archive

Tuning the Magnetic Ordering Temperature of Hexagonal Ferrites by Structural Distortion Control

Author: Cao Huibo
Cheng Xuemei
Keavney David J
Liu Yaohua
Sinha Kishan
Wang Haohan
Wang Wenbin
Wang Xiao
Wu Xifan
Xu Xiaoshan
Yin Yuewei
Zhou Liying
Publication venue: 'American Physical Society (APS)'
Publication date: 01/01/2018
Field of study

To tune the magnetic properties of hexagonal ferrites, a family of magnetoelectric multiferroic materials, by atomic-scale structural engineering, we studied the effect of structural distortion on the magnetic ordering temperature (TN). Using the symmetry analysis, we show that unlike most antiferromagnetic rare-earth transition-metal perovskites, a larger structural distortion leads to a higher TN in hexagonal ferrites and manganites, because the K3 structural distortion induces the three-dimensional magnetic ordering, which is forbidden in the undistorted structure by symmetry. We also revealed a near-linear relation between TN and the tolerance factor and a power-law relation between TN and the K3 distortion amplitude. Following the analysis, a record-high TN (185 K) among hexagonal ferrites was predicted in hexagonal ScFeO3 and experimentally verified in epitaxially stabilized films. These results add to the paradigm of spin-lattice coupling in antiferromagnetic oxides and suggests further tunability of hexagonal ferrites if more lattice distortion can be achieved

arXiv.org e-Print Archive

DigitalCommons@University of Nebraska

Scholarship, Research, and Creative Work at Bryn Mawr College | Bryn Mawr College Research