There are significant challenges for speaker adaptation in text-to-speech for
languages that are not widely spoken or for speakers with accents or dialects
that are not well-represented in the training data. To address this issue, we
propose the use of the "mixture of adapters" method. This approach involves
adding multiple adapters within a backbone-model layer to learn the unique
characteristics of different speakers. Our approach outperforms the baseline,
with a noticeable improvement of 5% observed in speaker preference tests when
using only one minute of data for each new speaker. Moreover, following the
adapter paradigm, we fine-tune only the adapter parameters (11% of the total
model parameters). This is a significant achievement in parameter-efficient
speaker adaptation, and one of the first models of its kind. Overall, our
proposed approach offers a promising solution to the speech synthesis
techniques, particularly for adapting to speakers from diverse backgrounds.Comment: Interspeech 202