With the recent developments in cross-lingual Text-to-Speech (TTS) systems,
L2 (second-language, or foreign) accent problems arise. Moreover, running a
subjective evaluation for such cross-lingual TTS systems is troublesome. The
vowel space analysis, which is often utilized to explore various aspects of
language including L2 accents, is a great alternative analysis tool. In this
study, we apply the vowel space analysis method to explore L2 accents of
cross-lingual TTS systems. Through the vowel space analysis, we observe the
three followings: a) a parallel architecture (Glow-TTS) is less L2-accented
than an auto-regressive one (Tacotron); b) L2 accents are more dominant in
non-shared vowels in a language pair; and c) L2 accents of cross-lingual TTS
systems share some phenomena with those of human L2 learners. Our findings
imply that it is necessary for TTS systems to handle each language pair
differently, depending on their linguistic characteristics such as non-shared
vowels. They also hint that we can further incorporate linguistics knowledge in
developing cross-lingual TTS systems.Comment: Submitted to ICASSP 202