138 research outputs found
Speaker-independent neural formant synthesis
We describe speaker-independent speech synthesis driven by a small set of
phonetically meaningful speech parameters such as formant frequencies. The
intention is to leverage deep-learning advances to provide a highly realistic
signal generator that includes control affordances required for stimulus
creation in the speech sciences. Our approach turns input speech parameters
into predicted mel-spectrograms, which are rendered into waveforms by a
pre-trained neural vocoder. Experiments with WaveNet and HiFi-GAN confirm that
the method achieves our goals of accurate control over speech parameters
combined with high perceptual audio quality. We also find that the small set of
phonetically relevant speech parameters we use is sufficient to allow for
speaker-independent synthesis (a.k.a. universal vocoding).Comment: 5 pages, 4 figures. Article accepted at INTERSPEECH 202
Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS
Self-supervised learning (SSL) proficiency in speech-related tasks has driven
research into utilizing discrete tokens for speech tasks like recognition and
translation, which offer lower storage requirements and great potential to
employ natural language processing techniques. However, these studies, mainly
single-task focused, faced challenges like overfitting and performance
degradation in speech recognition tasks, often at the cost of sacrificing
performance in multi-task scenarios. This study presents a comprehensive
comparison and optimization of discrete tokens generated by various leading SSL
models in speech recognition and synthesis tasks. We aim to explore the
universality of speech discrete tokens across multiple speech tasks.
Experimental results demonstrate that discrete tokens achieve comparable
results against systems trained on FBank features in speech recognition tasks
and outperform mel-spectrogram features in speech synthesis in subjective and
objective metrics. These findings suggest that universal discrete tokens have
enormous potential in various speech-related tasks. Our work is open-source and
publicly available to facilitate research in this direction
On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis
Self-supervised learning (SSL) speech representations learned from large
amounts of diverse, mixed-quality speech data without transcriptions are
gaining ground in many speech technology applications. Prior work has shown
that SSL is an effective intermediate representation in two-stage
text-to-speech (TTS) for both read and spontaneous speech. However, it is still
not clear which SSL and which layer from each SSL model is most suited for
spontaneous TTS. We address this shortcoming by extending the scope of
comparison for SSL in spontaneous TTS to 6 different SSLs and 3 layers within
each SSL. Furthermore, SSL has also shown potential in predicting the mean
opinion scores (MOS) of synthesized speech, but this has only been done in
read-speech MOS prediction. We extend an SSL-based MOS prediction framework
previously developed for scoring read speech synthesis and evaluate its
performance on synthesized spontaneous speech. All experiments are conducted
twice on two different spontaneous corpora in order to find generalizable
trends. Overall, we present comprehensive experimental results on the use of
SSL in spontaneous TTS and MOS prediction to further quantify and understand
how SSL can be used in spontaneous TTS. Audios samples:
https://www.speech.kth.se/tts-demos/sp_ssl_ttsComment: 7 pages, 2 figures. 12th ISCA Speech Synthesis Workshop (SSW) 202
- …