26 research outputs found
ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation
There are significant challenges for speaker adaptation in text-to-speech for
languages that are not widely spoken or for speakers with accents or dialects
that are not well-represented in the training data. To address this issue, we
propose the use of the "mixture of adapters" method. This approach involves
adding multiple adapters within a backbone-model layer to learn the unique
characteristics of different speakers. Our approach outperforms the baseline,
with a noticeable improvement of 5% observed in speaker preference tests when
using only one minute of data for each new speaker. Moreover, following the
adapter paradigm, we fine-tune only the adapter parameters (11% of the total
model parameters). This is a significant achievement in parameter-efficient
speaker adaptation, and one of the first models of its kind. Overall, our
proposed approach offers a promising solution to the speech synthesis
techniques, particularly for adapting to speakers from diverse backgrounds.Comment: Interspeech 202
VoiceLens: Controllable Speaker Generation and Editing with Flow
Currently, many multi-speaker speech synthesis and voice conversion systems
address speaker variations with an embedding vector. Modeling it directly
allows new voices outside of training data to be synthesized. GMM based
approaches such as Tacospawn are favored in literature for this generation
task, but there are still some limitations when difficult conditionings are
involved. In this paper, we propose VoiceLens, a semi-supervised flow-based
approach, to model speaker embedding distributions for multi-conditional
speaker generation. VoiceLens maps speaker embeddings into a combination of
independent attributes and residual information. It allows new voices
associated with certain attributes to be \textit{generated} for existing TTS
models, and attributes of known voices to be meaningfully \textit{edited}. We
show in this paper, VoiceLens displays an unconditional generation capacity
that is similar to Tacospawn while obtaining higher controllability and
flexibility when used in a conditional manner. In addition, we show
synthesizing less noisy speech from known noisy speakers without re-training
the TTS model is possible via solely editing their embeddings with a SNR
conditioned VoiceLens model. Demos are available at
sos1sos2sixteen.github.io/voicelens
Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with Disentangled Representations
While most research into speech synthesis has focused on synthesizing
high-quality speech for in-dataset speakers, an equally essential yet unsolved
problem is synthesizing speech for unseen speakers who are out-of-dataset with
limited reference data, i.e., speaker adaptive speech synthesis. Many studies
have proposed zero-shot speaker adaptive text-to-speech and voice conversion
approaches aimed at this task. However, most current approaches suffer from the
degradation of naturalness and speaker similarity when synthesizing speech for
unseen speakers (i.e., speakers not in the training dataset) due to the poor
generalizability of the model in out-of-distribution data. To address this
problem, we propose GZS-TV, a generalizable zero-shot speaker adaptive
text-to-speech and voice conversion model. GZS-TV introduces disentangled
representation learning for both speaker embedding extraction and timbre
transformation to improve model generalization and leverages the representation
learning capability of the variational autoencoder to enhance the speaker
encoder. Our experiments demonstrate that GZS-TV reduces performance
degradation on unseen speakers and outperforms all baseline models in multiple
datasets.Comment: 5 pages, 3 figures. Accepted by Interspeech 2023, Ora