7 research outputs found

    Normal-to-Lombard Adaptation of Speech Synthesis Using Long Short-Term Memory Recurrent Neural Networks

    Get PDF
    In this article, three adaptation methods are compared based on how well they change the speaking style of a neural network based text-to-speech (TTS) voice. The speaking style conversion adopted here is from normal to Lombard speech. The selected adaptation methods are: auxiliary features (AF), learning hidden unit contribution (LHUC), and fine-tuning (FT). Furthermore, four state-of-the-art TTS vocoders are compared in the same context. The evaluated vocoders are: GlottHMM, GlottDNN, STRAIGHT, and pulse model in log-domain (PML). Objective and subjective evaluations were conducted to study the performance of both the adaptation methods and the vocoders. In the subjective evaluations, speaking style similarity and speech intelligibility were assessed. In addition to acoustic model adaptation, phoneme durations were also adapted from normal to Lombard with the FT adaptation method. In objective evaluations and speaking style similarity tests, we found that the FT method outperformed the other two adaptation methods. In speech intelligibility tests, we found that there were no significant differences between vocoders although the PML vocoder showed slightly better performance compared to the three other vocoders.Peer reviewe

    Voice banking for individuals living with MND : a service review

    Get PDF
    BACKGROUND: Voice banking allows those living with Motor Neurone Disease (MND) to create a personalised synthetic voice. Little is known about how best to support this process. OBJECTIVE: To review a dedicated voice banking service with the aim of informing service development. METHOD: A service review of existing health records from neurological services in Sheffield, UK, carried out retrospectively and covering 2018 and 2019. Case notes were reviewed to extract information about use of communication aids, offer of voice banking, and use of synthesised speech. Responses to a routine follow up survey were also collated. RESULTS: Less than half of the clients whose notes were reviewed had been informed about voice banking, one in four had completed the voice banking process, around half were using communication aids, and one in ten were using their personalised synthetic voice on a communication aid. The time taken to complete the process had a large variation. Those completing the process viewed the personalised voices positively and all were used when created. Support from professionals was noted by some as being key. CONCLUSIONS: Voice banking services should be more widely promoted to ensure that individuals can consider voice banking prior to changes in their speech. Research studies should inform how and when those living with MND are introduced to voice banking

    Integrated speaker-adaptive speech synthesis

    No full text
    Enabling speech synthesis systems to rapidly adapt to sound like a particular speaker is an essential attribute for building personalised systems. For deep-learning based approaches, this is difficult as these networks use a highly distributed representation. It is not simple to interpret the model parameters, which complicates the adaptation process. To address this problem, speaker characteristics can be encapsulated in fixed-length speaker-specific Identity Vectors (iVectors), which are appended to the input of the synthesis network. Altering the iVector changes the nature of the synthesised speech. The challenge is to derive an optimal iVector for each speaker that encodes all the speaker attributes required for the synthesis system. The standard approach involves two separate stages: estimation of the iVectors for the training data; and training the synthesis network. This paper proposes an integrated training scheme for speaker adaptive speech synthesis. For the iVector extraction, an attention based mechanism, which is a function of the context labels, is used to combine the data from the target speaker. This attention mechanism, as well as nature of the features being merged, are optimised at the same time as the synthesis network parameters. This should yield an iVector-like speaker representation that is optimal for use with the synthesis system. The system is evaluated on the Voice Bank corpus. The resulting system automatically provides a sensible attention sequence and shows improved performance from the standard approach
    corecore