17 research outputs found

    Dimensions of quality for state of the art synthetic speech

    Get PDF

    Dimensions of quality for state of the art synthetic speech

    Get PDF

    Investigating Speaker Embedding Disentanglement on Natural Read Speech

    Full text link
    Disentanglement is the task of learning representations that identify and separate factors that explain the variation observed in data. Disentangled representations are useful to increase the generalizability, explainability, and fairness of data-driven models. Only little is known about how well such disentanglement works for speech representations. A major challenge when tackling disentanglement for speech representations are the unknown generative factors underlying the speech signal. In this work, we investigate to what degree speech representations encoding speaker identity can be disentangled. To quantify disentanglement, we identify acoustic features that are highly speaker-variant and can serve as proxies for the factors of variation underlying speech. We find that disentanglement of the speaker embedding is limited when trained with standard objectives promoting disentanglement but can be improved over vanilla representation learning to some extent.Comment: To be published at 15th ITG conference on speech communicatio

    On Feature Importance and Interpretability of Speaker Representations

    Full text link
    Unsupervised speech disentanglement aims at separating fast varying from slowly varying components of a speech signal. In this contribution, we take a closer look at the embedding vector representing the slowly varying signal components, commonly named the speaker embedding vector. We ask, which properties of a speaker's voice are captured and investigate to which extent do individual embedding vector components sign responsible for them, using the concept of Shapley values. Our findings show that certain speaker-specific acoustic-phonetic properties can be fairly well predicted from the speaker embedding, while the investigated more abstract voice quality features cannot.Comment: Presented at the ITG conference on Speech Communication 202

    Re-examining the quality dimensions of synthetic speech

    No full text
    Seebauer FM, Kuhlmann M, Haeb-Umbach R, Wagner P. Re-examining the quality dimensions of synthetic speech. In: 12th ISCA Speech Synthesis Workshop (SSW2023). ISCA; 2023: 34-40

    Dimensions of quality for state of the art synthetic speech

    No full text
    Seebauer FM, Wagner P. Dimensions of quality for state of the art synthetic speech. In: Bruggeman A, Ludusan B, UniversitĂ€t Bielefeld, eds. Berichtsband der 18. Konferenz fĂŒr Phonetik und Phonologie im deutschsprachigen Raum . Bielefeld; 2022.Synthetic speech has a long standing tradition of being employed for experiments in phonetics and laboratory phonology. The choice of synthesis method and system is commonly made by the researcher(s) to fit the specific quality criteria and study design. The overall quality of a given system, however, remains as a confound that is difficult to control for [1]. In speech technology newly proposed systems are usually compared across specific dimensions e.g., ‘Intelligibility’ and ‘Naturalness’. These dimensions have already been extensively studied and evaluated within the context of old diphone and formant synthesis networks [2]. We contend, however, that these tradi- tional dimensions need to be re-examined in the context of state of the Art Text-to-Speech (TTS) systems, as those newer models exhibit different quality deteriorations. Our work aims to bridge the conflicting demands for quality criteria that are easily computed and applied during TTS de- velopment, while at the same time remaining descriptive and meaningful for phonetic research. As a first step in this endeavor, we carried out an experiment to find suitable dimensions of TTS quality with a bottom-up approach based on descriptions provided by 11 participants (phonetic experts). The participants were instructed to label speech samples generated by 8 different state of the art Text-to-speech systems (varieties of English). Each system produced a stimulus consisting of two sentences of the phonetically balanced ‘caterpillar story’ [3]. In order to ensure that all systems were evaluated across different phonetic contexts in a balanced way, the sentences were rotated between participants so that each participant heard the complete story but with different parts read by different systems. The experimental setup is loosely based on the work in [4]. The participants were instructed to write down nouns, adjectives or sentences describing the quality of a given stimulus. Using embeddings generated by a pretrained BERT model [5] for semantic distances, we determined which of the participants terms were semantically similar. A subsequent affinity propagation clustering revealed there to be 39 meaningfully different clusters, each rep- resenting a dimension of quality for synthetic voices. Keeping in mind that these dimensions are later to be used for ratings in actual evaluation experiments, it was decided to reduce the num- ber of clusters to a more practical number of 10 and re-calculate the spectral clustering with a precomputed cosine affinity matrix. The resulting clusters and their respective quality descrip- tions are depicted in fig. 1. A manual analysis of the resulting dimensions led to the following descriptive labels: ‘artificiality/voice quality’, ‘intonation/noise/prosody’, ‘voice/audio quality’, ‘audio cuts’, ‘style/recording quality’, ‘emotion/voice quality/attitude’, ‘engagedness’, ‘human likeness’, ‘hyperarticulation’. From the assigned cluster descriptions it is evident, that the se- mantic embeddings sometimes conflated several seemingly unrelated quality features into single dimensions (e.g. prosody and background noise), while occasionally splitting almost synonymous terms into multiple clusters (e.g. ‘artificiality’, ‘roboticness’ and ‘metallicness’). To evaluate these shortcomings of the semantic model, two independent manual clusterings were carried out. They were both limited to 10 clusters and reported a modified jaccard agreement index of 63,44, while agreeing with the automatic computed clusters with 54.48 and 57.93, respectively. The low in- terrater agreement between the manual clusters suggests that a panel decision process might be needed to determine the final quality dimensions. Subsequent research will evaluate clusters cre- ated by na ̈ıve listeners and quality dimensions of different sub-tasks in synthetic speech, such as voice conversion

    Investigation into Target Speaking Rate Adaptation for Voice Conversion

    No full text
    Kuhlmann M, Seebauer FM, Ebbers J, Wagner P, Haeb-Umbach R. Investigation into Target Speaking Rate Adaptation for Voice Conversion. In: Proceedings of Interspeech. 2022: 4930--4934.Disentangling speaker and content attributes of a speech signal into separate latent representations followed by decoding the content with an exchanged speaker representation is a popular approach for voice conversion, which can be trained with non-parallel and unlabeled speech data. However, previous approaches perform disentanglement only implicitly via some sort of information bottleneck or normalization, where it is usually hard to find a good trade-off between voice conversion and content reconstruction. Further, previous works usually do not consider an adaptation of the speaking rate to the target speaker or they put some major restrictions to the data or use case. Therefore, the contribution of this work is two-fold. First, we employ an explicit and fully unsupervised disentanglement approach, which has previously only been used for representation learning, and show that it allows to obtain both superior voice conversion and content reconstruction. Second, we investigate simple and generic approaches to linearly adapt the length of a speech signal, and hence the speaking rate, to a target speaker and show that the proposed adaptation allows to increase the speaking rate similarity with respect to the target speaker

    Discerning dimensions of quality for state of the art synthetic speech

    No full text
    Seebauer FM, Kuhlmann M, Haeb-Umbach R, Wagner P. Discerning dimensions of quality for state of the art synthetic speech. In: Skarnitzl R, Volín J, eds. Proceedings of the 20th International Congress of Phonetic Sciences. Prague; 2023: 3106-3110.This paper describes an approach for determining the dimensions of quality for state-of-the-art synthetic speech. We propose that current evaluation metrics do not fully capture the meaningful dimensions of text-to-speech (TTS) and voice conversion (VC) systems. In order to develop a revised paradigm for meaningful evaluation, we conducted two experiments. First, we determined descriptive terms by querying naïve listeners on their impressions of modern TTS and VC systems. In a second experiment, we refined these terms into dimensions of quality and similarity by showcasing a consolidation procedure of manual clusterings. The resulting dimensions contain the standard evaluation categories of “intelligibility” and “naturalness” for both conditions. We could additionally discern dimensions of “tempo” and “demographics” in both domains. The final two dimensions as well as the relationships between categories proved to be different between TTS and VC, suggesting the need for modified evaluation scales based on the target construct

    Speech Disentanglement for Analysis and Modification of Acoustic and Perceptual Speaker Characteristics

    No full text
    Rautenberg F, Kuhlmann M, Ebbers J, et al. Speech Disentanglement for Analysis and Modification of Acoustic and Perceptual Speaker Characteristics. In: Deutsche Gesellschaft fĂŒr Akustik e.V. (DEGA), ed. Fortschritte der Akustik - DAGA 2023. Tagungsband. Berlin; 2023: 1409-1412.Popular speech disentanglement systems decompose a speech signal into a content and a speaker embedding, where a decoder reconstructs the input signal from these embeddings. Often, it is unknown, which information is encoded in the speaker embeddings. In this work, such a system is investigated on German speech data. We show that directions in the speaker embeddings space correlate with different acoustic signal properties that are known to be characteristics of a speaker, and manipulating these embeddings in that direction, the decoder synthesises a speech signal with modified acoustic properties
    corecore