58 research outputs found
Towards High-Quality Neural TTS for Low-Resource Languages by Learning Compact Speech Representations
This paper aims to enhance low-resource TTS by reducing training data
requirements using compact speech representations. A Multi-Stage Multi-Codebook
(MSMC) VQ-GAN is trained to learn the representation, MSMCR, and decode it to
waveforms. Subsequently, we train the multi-stage predictor to predict MSMCRs
from the text for TTS synthesis. Moreover, we optimize the training strategy by
leveraging more audio to learn MSMCRs better for low-resource languages. It
selects audio from other languages using speaker similarity metric to augment
the training set, and applies transfer learning to improve training quality. In
MOS tests, the proposed system significantly outperforms FastSpeech and VITS in
standard and low-resource scenarios, showing lower data requirements. The
proposed training strategy effectively enhances MSMCRs on waveform
reconstruction. It improves TTS performance further, which wins 77% votes in
the preference test for the low-resource TTS with only 15 minutes of paired
data.Comment: Submitted to ICASSP 202
A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural TTS
We propose a Multi-Stage, Multi-Codebook (MSMC) approach to high-performance
neural TTS synthesis. A vector-quantized, variational autoencoder (VQ-VAE)
based feature analyzer is used to encode Mel spectrograms of speech training
data by down-sampling progressively in multiple stages into MSMC
Representations (MSMCRs) with different time resolutions, and quantizing them
with multiple VQ codebooks, respectively. Multi-stage predictors are trained to
map the input text sequence to MSMCRs progressively by minimizing a combined
loss of the reconstruction Mean Square Error (MSE) and "triplet loss". In
synthesis, the neural vocoder converts the predicted MSMCRs into final speech
waveforms. The proposed approach is trained and tested with an English TTS
database of 16 hours by a female speaker. The proposed TTS achieves an MOS
score of 4.41, which outperforms the baseline with an MOS of 3.62. Compact
versions of the proposed TTS with much less parameters can still preserve high
MOS scores. Ablation studies show that both multiple stages and multiple
codebooks are effective for achieving high TTS performance
QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning
This paper proposes a novel semi-supervised TTS framework, QS-TTS, to improve
TTS quality with lower supervised data requirements via Vector-Quantized
Self-Supervised Speech Representation Learning (VQ-S3RL) utilizing more
unlabeled speech audio. This framework comprises two VQ-S3R learners: first,
the principal learner aims to provide a generative Multi-Stage Multi-Codebook
(MSMC) VQ-S3R via the MSMC-VQ-GAN combined with the contrastive S3RL, while
decoding it back to the high-quality audio; then, the associate learner further
abstracts the MSMC representation into a highly-compact VQ representation
through a VQ-VAE. These two generative VQ-S3R learners provide profitable
speech representations and pre-trained models for TTS, significantly improving
synthesis quality with the lower requirement for supervised data. QS-TTS is
evaluated comprehensively under various scenarios via subjective and objective
tests in experiments. The results powerfully demonstrate the superior
performance of QS-TTS, winning the highest MOS over supervised or
semi-supervised baseline TTS approaches, especially in low-resource scenarios.
Moreover, comparing various speech representations and transfer learning
methods in TTS further validates the notable improvement of the proposed
VQ-S3RL to TTS, showing the best audio quality and intelligibility metrics. The
trend of slower decay in the synthesis quality of QS-TTS with decreasing
supervised data further highlights its lower requirements for supervised data,
indicating its great potential in low-resource scenarios
Tuning the Magnetic Ordering Temperature of Hexagonal Ferrites by Structural Distortion Control
To tune the magnetic properties of hexagonal ferrites, a family of
magnetoelectric multiferroic materials, by atomic-scale structural engineering,
we studied the effect of structural distortion on the magnetic ordering
temperature (TN). Using the symmetry analysis, we show that unlike most
antiferromagnetic rare-earth transition-metal perovskites, a larger structural
distortion leads to a higher TN in hexagonal ferrites and manganites, because
the K3 structural distortion induces the three-dimensional magnetic ordering,
which is forbidden in the undistorted structure by symmetry. We also revealed a
near-linear relation between TN and the tolerance factor and a power-law
relation between TN and the K3 distortion amplitude. Following the analysis, a
record-high TN (185 K) among hexagonal ferrites was predicted in hexagonal
ScFeO3 and experimentally verified in epitaxially stabilized films. These
results add to the paradigm of spin-lattice coupling in antiferromagnetic
oxides and suggests further tunability of hexagonal ferrites if more lattice
distortion can be achieved
- …