41 research outputs found

    Reimagining Speech: A Scoping Review of Deep Learning-Powered Voice Conversion

    Full text link
    Research on deep learning-powered voice conversion (VC) in speech-to-speech scenarios is getting increasingly popular. Although many of the works in the field of voice conversion share a common global pipeline, there is a considerable diversity in the underlying structures, methods, and neural sub-blocks used across research efforts. Thus, obtaining a comprehensive understanding of the reasons behind the choice of the different methods in the voice conversion pipeline can be challenging, and the actual hurdles in the proposed solutions are often unclear. To shed light on these aspects, this paper presents a scoping review that explores the use of deep learning in speech analysis, synthesis, and disentangled speech representation learning within modern voice conversion systems. We screened 621 publications from more than 38 different venues between the years 2017 and 2023, followed by an in-depth review of a final database consisting of 123 eligible studies. Based on the review, we summarise the most frequently used approaches to voice conversion based on deep learning and highlight common pitfalls within the community. Lastly, we condense the knowledge gathered, identify main challenges and provide recommendations for future research directions

    Analysis, Disentanglement, and Conversion of Singing Voice Attributes

    Get PDF
    Voice conversion is a prominent area of research, which can typically be described as the replacement of acoustic cues that relate to the perceived identity of the voice. Over almost a decade, deep learning has emerged as a transformative solution for this multifaceted task, offering various advancements to address different conditions and challenges in the field. One intriguing avenue for researchers in the field of Music Information Retrieval is singing voice conversion - a task that has only been subjected to neural network analysis and synthesis techniques over the last four years. The conversion of various singing voice attributes introduces new considerations, including working with limited datasets, adhering to musical context restrictions and considering how expression in singing is manifested in such attributes. Voice conversion with respect to singing techniques, for example, has received little attention even though its impact on the music industry would be considerable and important. This thesis therefore delves into problems related to vocal perception, limited datasets, and attribute disentanglement in the pursuit of optimal performance for the conversion of attributes that are scarcely labelled, which are covered across three research chapters. The first of these chapters describes the collection of perceptual pairwise dissimilarity ratings for singing techniques from participants. These were subsequently subjected to clustering algorithms and compared against existing ground truth labels. The results confirm the viability of using existing singing technique-labelled datasets for singing technique conversion (STC) using supervised machine learning strategies. A dataset of dissimilarity ratings and timbral maps was generated, illustrating how register and gender conditions affect perception. The first of these chapters describes the collection of perceptual pairwise dissimilarity ratings for singing techniques from participants. These were subsequently subjected to clustering algorithms and compared against existing ground truth labels. The results confirm the viability of using existing singing technique-labelled datasets for singing technique conversion (STC) using supervised machine learning strategies. A dataset of dissimilarity ratings and timbral maps was generated, illustrating how register and gender conditions affect perception. In response to these findings, an adapted version of an existing voice conversion system in conjunction with an existing labelled dataset was developed. This served as the first implementation of a model for zero-shot STC, although it exhibited varying levels of success. An alternative method of attribute conversion was therefore considered as a means towards performing satisfactorily realistic STC. By refining ‘voice identity’ conversion for singing, future research can be conducted where this attribute, along with more deterministic attributes (such as pitch, loudness, and phonetics) can be disentangled from an input signal, exposing information related to unlabelled attributes. Final experiments in refining the task of voice identity conversion for the singing domain were conducted as a stepping stone towards unlabelled attribute conversion. By performing comparative analyses between different features, singing and speech domains, and alternative loss functions, the most suitable process for singing voice attribute conversion (SVAC) could be established. In summary, this thesis documents a series of experiments that explore different aspects of the singing voice and conversion techniques in the pursuit of devising a convincing SVAC system

    SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DSYARTHRIC SPEECH RECOGNITION

    Get PDF
    Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems may help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech is required, which is not readily available for dysarthric talkers. In this dissertation, we investigate dysarthric speech augmentation and synthesis methods. To better understand differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels, a comparative study between typical and dysarthric speech was conducted. These characteristics are important components for dysarthric speech modeling, synthesis, and augmentation. For augmentation, prosodic transformation and time-feature masking have been proposed. For dysarthric speech synthesis, this dissertation has introduced a modified neural multi-talker TTS by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. In addition, we have extended this work by using a label propagation technique to create more meaningful control variables such as a continuous Respiration, Laryngeal and Tongue (RLT) parameter, even for datasets that only provide discrete dysarthria severity level information. This approach increases the controllability of the system, so we are able to generate more dysarthric speech with a broader range. To evaluate their effectiveness for synthesis of training data, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has a significant impact on the dysarthric ASR systems

    Learning disentangled speech representations

    Get PDF
    A variety of informational factors are contained within the speech signal and a single short recording of speech reveals much more than the spoken words. The best method to extract and represent informational factors from the speech signal ultimately depends on which informational factors are desired and how they will be used. In addition, sometimes methods will capture more than one informational factor at the same time such as speaker identity, spoken content, and speaker prosody. The goal of this dissertation is to explore different ways to deconstruct the speech signal into abstract representations that can be learned and later reused in various speech technology tasks. This task of deconstructing, also known as disentanglement, is a form of distributed representation learning. As a general approach to disentanglement, there are some guiding principles that elaborate what a learned representation should contain as well as how it should function. In particular, learned representations should contain all of the requisite information in a more compact manner, be interpretable, remove nuisance factors of irrelevant information, be useful in downstream tasks, and independent of the task at hand. The learned representations should also be able to answer counter-factual questions. In some cases, learned speech representations can be re-assembled in different ways according to the requirements of downstream applications. For example, in a voice conversion task, the speech content is retained while the speaker identity is changed. And in a content-privacy task, some targeted content may be concealed without affecting how surrounding words sound. While there is no single-best method to disentangle all types of factors, some end-to-end approaches demonstrate a promising degree of generalization to diverse speech tasks. This thesis explores a variety of use-cases for disentangled representations including phone recognition, speaker diarization, linguistic code-switching, voice conversion, and content-based privacy masking. Speech representations can also be utilised for automatically assessing the quality and authenticity of speech, such as automatic MOS ratings or detecting deep fakes. The meaning of the term "disentanglement" is not well defined in previous work, and it has acquired several meanings depending on the domain (e.g. image vs. speech). Sometimes the term "disentanglement" is used interchangeably with the term "factorization". This thesis proposes that disentanglement of speech is distinct, and offers a viewpoint of disentanglement that can be considered both theoretically and practically
    corecore