17 research outputs found
Investigating Speaker Embedding Disentanglement on Natural Read Speech
Disentanglement is the task of learning representations that identify and
separate factors that explain the variation observed in data. Disentangled
representations are useful to increase the generalizability, explainability,
and fairness of data-driven models. Only little is known about how well such
disentanglement works for speech representations. A major challenge when
tackling disentanglement for speech representations are the unknown generative
factors underlying the speech signal. In this work, we investigate to what
degree speech representations encoding speaker identity can be disentangled. To
quantify disentanglement, we identify acoustic features that are highly
speaker-variant and can serve as proxies for the factors of variation
underlying speech. We find that disentanglement of the speaker embedding is
limited when trained with standard objectives promoting disentanglement but can
be improved over vanilla representation learning to some extent.Comment: To be published at 15th ITG conference on speech communicatio
On Feature Importance and Interpretability of Speaker Representations
Unsupervised speech disentanglement aims at separating fast varying from
slowly varying components of a speech signal. In this contribution, we take a
closer look at the embedding vector representing the slowly varying signal
components, commonly named the speaker embedding vector. We ask, which
properties of a speaker's voice are captured and investigate to which extent do
individual embedding vector components sign responsible for them, using the
concept of Shapley values. Our findings show that certain speaker-specific
acoustic-phonetic properties can be fairly well predicted from the speaker
embedding, while the investigated more abstract voice quality features cannot.Comment: Presented at the ITG conference on Speech Communication 202
Re-examining the quality dimensions of synthetic speech
Seebauer FM, Kuhlmann M, Haeb-Umbach R, Wagner P. Re-examining the quality dimensions of synthetic speech. In: 12th ISCA Speech Synthesis Workshop (SSW2023). ISCA; 2023: 34-40
Dimensions of quality for state of the art synthetic speech
Seebauer FM, Wagner P. Dimensions of quality for state of the art synthetic speech. In: Bruggeman A, Ludusan B, UniversitĂ€t Bielefeld, eds. Berichtsband der 18. Konferenz fĂŒr Phonetik und Phonologie im deutschsprachigen Raum . Bielefeld; 2022.Synthetic speech has a long standing tradition of being employed for experiments in phonetics
and laboratory phonology. The choice of synthesis method and system is commonly made by the
researcher(s) to fit the specific quality criteria and study design. The overall quality of a given
system, however, remains as a confound that is difficult to control for [1]. In speech technology
newly proposed systems are usually compared across specific dimensions e.g., âIntelligibilityâ and
âNaturalnessâ. These dimensions have already been extensively studied and evaluated within the
context of old diphone and formant synthesis networks [2]. We contend, however, that these tradi-
tional dimensions need to be re-examined in the context of state of the Art Text-to-Speech (TTS)
systems, as those newer models exhibit different quality deteriorations. Our work aims to bridge
the conflicting demands for quality criteria that are easily computed and applied during TTS de-
velopment, while at the same time remaining descriptive and meaningful for phonetic research.
As a first step in this endeavor, we carried out an experiment to find suitable dimensions of TTS
quality with a bottom-up approach based on descriptions provided by 11 participants (phonetic
experts). The participants were instructed to label speech samples generated by 8 different state of
the art Text-to-speech systems (varieties of English). Each system produced a stimulus consisting
of two sentences of the phonetically balanced âcaterpillar storyâ [3]. In order to ensure that all
systems were evaluated across different phonetic contexts in a balanced way, the sentences were
rotated between participants so that each participant heard the complete story but with different
parts read by different systems. The experimental setup is loosely based on the work in [4]. The
participants were instructed to write down nouns, adjectives or sentences describing the quality
of a given stimulus. Using embeddings generated by a pretrained BERT model [5] for semantic
distances, we determined which of the participants terms were semantically similar. A subsequent
affinity propagation clustering revealed there to be 39 meaningfully different clusters, each rep-
resenting a dimension of quality for synthetic voices. Keeping in mind that these dimensions are
later to be used for ratings in actual evaluation experiments, it was decided to reduce the num-
ber of clusters to a more practical number of 10 and re-calculate the spectral clustering with a
precomputed cosine affinity matrix. The resulting clusters and their respective quality descrip-
tions are depicted in fig. 1. A manual analysis of the resulting dimensions led to the following
descriptive labels: âartificiality/voice qualityâ, âintonation/noise/prosodyâ, âvoice/audio qualityâ,
âaudio cutsâ, âstyle/recording qualityâ, âemotion/voice quality/attitudeâ, âengagednessâ, âhuman
likenessâ, âhyperarticulationâ. From the assigned cluster descriptions it is evident, that the se-
mantic embeddings sometimes conflated several seemingly unrelated quality features into single
dimensions (e.g. prosody and background noise), while occasionally splitting almost synonymous
terms into multiple clusters (e.g. âartificialityâ, âroboticnessâ and âmetallicnessâ). To evaluate these
shortcomings of the semantic model, two independent manual clusterings were carried out. They
were both limited to 10 clusters and reported a modified jaccard agreement index of 63,44, while
agreeing with the automatic computed clusters with 54.48 and 57.93, respectively. The low in-
terrater agreement between the manual clusters suggests that a panel decision process might be
needed to determine the final quality dimensions. Subsequent research will evaluate clusters cre-
ated by na 퀱ve listeners and quality dimensions of different sub-tasks in synthetic speech, such as
voice conversion
Investigation into Target Speaking Rate Adaptation for Voice Conversion
Kuhlmann M, Seebauer FM, Ebbers J, Wagner P, Haeb-Umbach R. Investigation into Target Speaking Rate Adaptation for Voice Conversion. In: Proceedings of Interspeech. 2022: 4930--4934.Disentangling speaker and content attributes of a speech signal into separate latent representations followed by decoding the content with an exchanged speaker representation is a popular approach for voice conversion, which can be trained with non-parallel and unlabeled speech data. However, previous approaches perform disentanglement only implicitly via some sort of information bottleneck or normalization, where it is usually hard to find a good trade-off between voice conversion and content reconstruction. Further, previous works usually do not consider an adaptation of the speaking rate to the target speaker or they put some major restrictions to the data or use case. Therefore, the contribution of this work is two-fold. First, we employ an explicit and fully unsupervised disentanglement approach, which has previously only been used for representation learning, and show that it allows to obtain both superior voice conversion and content reconstruction. Second, we investigate simple and generic approaches to linearly adapt the length of a speech signal, and hence the speaking rate, to a target speaker and show that the proposed adaptation allows to increase the speaking rate similarity with respect to the target speaker
Discerning dimensions of quality for state of the art synthetic speech
Seebauer FM, Kuhlmann M, Haeb-Umbach R, Wagner P. Discerning dimensions of quality for state of the art synthetic speech. In: Skarnitzl R, VolĂn J, eds. Proceedings of the 20th International Congress of Phonetic Sciences. Prague; 2023: 3106-3110.This paper describes an approach for determining
the dimensions of quality for state-of-the-art
synthetic speech. We propose that current evaluation
metrics do not fully capture the meaningful
dimensions of text-to-speech (TTS) and voice
conversion (VC) systems. In order to develop a
revised paradigm for meaningful evaluation, we
conducted two experiments. First, we determined
descriptive terms by querying naĂŻve listeners on
their impressions of modern TTS and VC systems.
In a second experiment, we refined these terms
into dimensions of quality and similarity by
showcasing a consolidation procedure of manual
clusterings. The resulting dimensions contain the
standard evaluation categories of âintelligibilityâ
and ânaturalnessâ for both conditions. We could
additionally discern dimensions of âtempoâ and
âdemographicsâ in both domains. The final two
dimensions as well as the relationships between
categories proved to be different between TTS and
VC, suggesting the need for modified evaluation
scales based on the target construct
Speech Disentanglement for Analysis and Modification of Acoustic and Perceptual Speaker Characteristics
Rautenberg F, Kuhlmann M, Ebbers J, et al. Speech Disentanglement for Analysis and Modification of Acoustic and Perceptual Speaker Characteristics. In: Deutsche Gesellschaft fĂŒr Akustik e.V. (DEGA), ed. Fortschritte der Akustik - DAGA 2023. Tagungsband. Berlin; 2023: 1409-1412.Popular speech disentanglement systems decompose a
speech signal into a content and a speaker embedding,
where a decoder reconstructs the input signal from these
embeddings. Often, it is unknown, which information is
encoded in the speaker embeddings. In this work, such a
system is investigated on German speech data. We show
that directions in the speaker embeddings space correlate
with different acoustic signal properties that are known
to be characteristics of a speaker, and manipulating these
embeddings in that direction, the decoder synthesises a
speech signal with modified acoustic properties