21 research outputs found
Multi-Scale Sub-Band Constant-Q Transform Discriminator for High-Fidelity Vocoder
Generative Adversarial Network (GAN) based vocoders are superior in inference
speed and synthesis quality when reconstructing an audible waveform from an
acoustic representation. This study focuses on improving the discriminator to
promote GAN-based vocoders. Most existing time-frequency-representation-based
discriminators are rooted in Short-Time Fourier Transform (STFT), whose
time-frequency resolution in a spectrogram is fixed, making it incompatible
with signals like singing voices that require flexible attention for different
frequency bands. Motivated by that, our study utilizes the Constant-Q Transform
(CQT), which owns dynamic resolution among frequencies, contributing to a
better modeling ability in pitch accuracy and harmonic tracking. Specifically,
we propose a Multi-Scale Sub-Band CQT (MS-SB-CQT) Discriminator, which operates
on the CQT spectrogram at multiple scales and performs sub-band processing
according to different octaves. Experiments conducted on both speech and
singing voices confirm the effectiveness of our proposed method. Moreover, we
also verified that the CQT-based and the STFT-based discriminators could be
complementary under joint training. Specifically, enhanced by the proposed
MS-SB-CQT and the existing MS-STFT Discriminators, the MOS of HiFi-GAN can be
boosted from 3.27 to 3.87 for seen singers and from 3.40 to 3.78 for unseen
singers
An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoder
Generative Adversarial Network (GAN) based vocoders are superior in both
inference speed and synthesis quality when reconstructing an audible waveform
from an acoustic representation. This study focuses on improving the
discriminator for GAN-based vocoders. Most existing Time-Frequency
Representation (TFR)-based discriminators are rooted in Short-Time Fourier
Transform (STFT), which owns a constant Time-Frequency (TF) resolution,
linearly scaled center frequencies, and a fixed decomposition basis, making it
incompatible with signals like singing voices that require dynamic attention
for different frequency bands and different time intervals. Motivated by that,
we propose a Multi-Scale Sub-Band Constant-Q Transform CQT (MS-SB-CQT)
discriminator and a Multi-Scale Temporal-Compressed Continuous Wavelet
Transform CWT (MS-TC-CWT) discriminator. Both CQT and CWT have a dynamic TF
resolution for different frequency bands. In contrast, CQT has a better
modeling ability in pitch information, and CWT has a better modeling ability in
short-time transients. Experiments conducted on both speech and singing voices
confirm the effectiveness of our proposed discriminators. Moreover, the STFT,
CQT, and CWT-based discriminators can be used jointly for better performance.
The proposed discriminators can boost the synthesis quality of various
state-of-the-art GAN-based vocoders, including HiFi-GAN, BigVGAN, and APNet.Comment: arXiv admin note: text overlap with arXiv:2311.1495
SponTTS: modeling and transferring spontaneous style for TTS
Spontaneous speaking style exhibits notable differences from other speaking
styles due to various spontaneous phenomena (e.g., filled pauses, prolongation)
and substantial prosody variation (e.g., diverse pitch and duration variation,
occasional non-verbal speech like a smile), posing challenges to modeling and
prediction of spontaneous style. Moreover, the limitation of high-quality
spontaneous data constrains spontaneous speech generation for speakers without
spontaneous data. To address these problems, we propose SponTTS, a two-stage
approach based on neural bottleneck (BN) features to model and transfer
spontaneous style for TTS. In the first stage, we adopt a Conditional
Variational Autoencoder (CVAE) to capture spontaneous prosody from a BN feature
and involve the spontaneous phenomena by the constraint of spontaneous
phenomena embedding prediction loss. Besides, we introduce a flow-based
predictor to predict a latent spontaneous style representation from the text,
which enriches the prosody and context-specific spontaneous phenomena during
inference. In the second stage, we adopt a VITS-like module to transfer the
spontaneous style learned in the first stage to the target speakers.
Experiments demonstrate that SponTTS is effective in modeling spontaneous style
and transferring the style to the target speakers, generating spontaneous
speech with high naturalness, expressiveness, and speaker similarity. The
zero-shot spontaneous style TTS test further verifies the generalization and
robustness of SponTTS in generating spontaneous speech for unseen speakers.Comment: 5 pages, 3 figures, Accepted by ICASSP202
SingVisio: Visual Analytics of Diffusion Model for Singing Voice Conversion
In this study, we present SingVisio, an interactive visual analysis system
that aims to explain the diffusion model used in singing voice conversion.
SingVisio provides a visual display of the generation process in diffusion
models, showcasing the step-by-step denoising of the noisy spectrum and its
transformation into a clean spectrum that captures the desired singer's timbre.
The system also facilitates side-by-side comparisons of different conditions,
such as source content, melody, and target timbre, highlighting the impact of
these conditions on the diffusion generation process and resulting conversions.
Through comprehensive evaluations, SingVisio demonstrates its effectiveness in
terms of system design, functionality, explainability, and user-friendliness.
It offers users of various backgrounds valuable learning experiences and
insights into the diffusion model for singing voice conversion
Text-aware and Context-aware Expressive Audiobook Speech Synthesis
Recent advances in text-to-speech have significantly improved the
expressiveness of synthetic speech. However, a major challenge remains in
generating speech that captures the diverse styles exhibited by professional
narrators in audiobooks without relying on manually labeled data or reference
speech. To address this problem, we propose a text-aware and
context-aware(TACA) style modeling approach for expressive audiobook speech
synthesis. We first establish a text-aware style space to cover diverse styles
via contrastive learning with the supervision of the speech style. Meanwhile,
we adopt a context encoder to incorporate cross-sentence information and the
style embedding obtained from text. Finally, we introduce the context encoder
to two typical TTS models, VITS-based TTS and language model-based TTS.
Experimental results demonstrate that our proposed approach can effectively
capture diverse styles and coherent prosody, and consequently improves
naturalness and expressiveness in audiobook speech synthesis.Comment: Accepted by INTERSPEECH202
Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features
Voice conversion for highly expressive speech is challenging. Current
approaches struggle with the balancing between speaker similarity,
intelligibility and expressiveness. To address this problem, we propose
Expressive-VC, a novel end-to-end voice conversion framework that leverages
advantages from both neural bottleneck feature (BNF) approach and information
perturbation approach. Specifically, we use a BNF encoder and a Perturbed-Wav
encoder to form a content extractor to learn linguistic and para-linguistic
features respectively, where BNFs come from a robust pre-trained ASR model and
the perturbed wave becomes speaker-irrelevant after signal perturbation. We
further fuse the linguistic and para-linguistic features through an attention
mechanism, where speaker-dependent prosody features are adopted as the
attention query, which result from a prosody encoder with target speaker
embedding and normalized pitch and energy of source speech as input. Finally
the decoder consumes the integrated features and the speaker-dependent prosody
feature to generate the converted speech. Experiments demonstrate that
Expressive-VC is superior to several state-of-the-art systems, achieving both
high expressiveness captured from the source speech and high speaker similarity
with the target speaker; meanwhile intelligibility is well maintained
Multi-level Temporal-channel Speaker Retrieval for Zero-shot Voice Conversion
Zero-shot voice conversion (VC) converts source speech into the voice of any
desired speaker using only one utterance of the speaker without requiring
additional model updates. Typical methods use a speaker representation from a
pre-trained speaker verification (SV) model or learn speaker representation
during VC training to achieve zero-shot VC. However, existing speaker modeling
methods overlook the variation of speaker information richness in temporal and
frequency channel dimensions of speech. This insufficient speaker modeling
hampers the ability of the VC model to accurately represent unseen speakers who
are not in the training dataset. In this study, we present a robust zero-shot
VC model with multi-level temporal-channel retrieval, referred to as MTCR-VC.
Specifically, to flexibly adapt to the dynamic-variant speaker characteristic
in the temporal and channel axis of the speech, we propose a novel fine-grained
speaker modeling method, called temporal-channel retrieval (TCR), to find out
when and where speaker information appears in speech. It retrieves
variable-length speaker representation from both temporal and channel
dimensions under the guidance of a pre-trained SV model. Besides, inspired by
the hierarchical process of human speech production, the MTCR speaker module
stacks several TCR blocks to extract speaker representations from
multi-granularity levels. Furthermore, to achieve better speech disentanglement
and reconstruction, we introduce a cycle-based training strategy to simulate
zero-shot inference recurrently. We adopt perpetual constraints on three
aspects, including content, style, and speaker, to drive this process.
Experiments demonstrate that MTCR-VC is superior to the previous zero-shot VC
methods in modeling speaker timbre while maintaining good speech naturalness.Comment: Submitted to TASL
Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation
The multi-codebook speech codec enables the application of large language
models (LLM) in TTS but bottlenecks efficiency and robustness due to
multi-sequence prediction. To avoid this obstacle, we propose Single-Codec, a
single-codebook single-sequence codec, which employs a disentangled VQ-VAE to
decouple speech into a time-invariant embedding and a phonetically-rich
discrete sequence. Furthermore, the encoder is enhanced with 1) contextual
modeling with a BLSTM module to exploit the temporal information, 2) a hybrid
sampling module to alleviate distortion from upsampling and downsampling, and
3) a resampling module to encourage discrete units to carry more phonetic
information. Compared with multi-codebook codecs, e.g., EnCodec and TiCodec,
Single-Codec demonstrates higher reconstruction quality with a lower bandwidth
of only 304bps. The effectiveness of Single-Code is further validated by
LLM-TTS experiments, showing improved naturalness and intelligibility.Comment: Accepted by Interspeech 202
