691 research outputs found

    Unsupervised Cross-lingual Speaker Adaptation for HMM-based Speech Synthesis

    Get PDF
    In the EMIME project, we are developing a mobile device that performs personalized speech-to-speech translation such that a user's spoken input in one language is used to produce spoken output in another language, while continuing to sound like the user's voice. We integrate two techniques, unsupervised adaptation for HMM-based TTS using a word-based large-vocabulary continuous speech recognizer and cross-lingual speaker adaptation for HMM-based TTS, into a single architecture. Thus, an unsupervised cross-lingual speaker adaptation system can be developed. Listening tests show very promising results, demonstrating that adapted voices sound similar to the target speaker and that differences between supervised and unsupervised cross-lingual speaker adaptation are small

    Data-Driven Enhancement of State Mapping-Based Cross-Lingual Speaker Adaptation

    Get PDF
    The thesis work was motivated by the goal of developing personalized speech-to-speech translation and focused on one of its key component techniques – cross-lingual speaker adaptation for text-to-speech synthesis. A personalized speech-to-speech translator enables a person’s spoken input to be translated into spoken output in another language while maintaining his/her voice identity. Before addressing any technical issues, work in this thesis set out to understand human perception of speaker identity. Listening tests were conducted in order to determine whether people could differentiate between speakers when they spoke different languages. The results demonstrated that differentiating between speakers across languages was an achievable task. However, it was difficult for listeners to differentiate between speakers across both languages and speech types (original recordings versus synthesized samples). The underlying challenge in cross-lingual speaker adaptation is how to apply speaker adaptation techniques when the language of adaptation data is different from that of synthesis models. The main body of the thesis work was devoted to the analysis and improvement of HMM state mapping-based cross-lingual speaker adaptation. Firstly, the effect of unsupervised cross-lingual adaptation was investigated, as it relates to the application scenario of personalized speech-to-speech translation. The comparison of paired supervised and unsupervised systems shows that the performance of unsupervised cross-lingual speaker adaptation is comparable to that of the supervised fashion, even if the average phoneme error rate of the unsupervised systems is around 75%. Then the effect of the language mismatch between synthesis models and adaptation data was investigated. The mismatch is found to transfer undesirable language information from adaptation data to synthesis models, thereby limiting the effectiveness of generating multiple regression class-specific transforms, using larger quantities of adaptation data and estimating adaptation transforms iteratively. Thirdly, in order to tackle the problems caused by the language mismatch, a data-driven adaptation framework using phonological knowledge is proposed. Its basic idea is to group HMM states according to phonological knowledge in a data-driven manner and then to map each state to a phonologically consistent counterpart in a different language. This framework is also applied to regression class tree construction for transform estimation. It is found that the proposed framework alleviates the negative effect of the language mismatch and gives consistent improvement compared to previous state-of-the-art approaches. Finally, a two-layer hierarchical transformation framework is developed, where one layer captures speaker characteristics and the other compensates for the language mismatch. The most appropriate means to construct the hierarchical arrangement of transforms was investigated in an initial study. While early results show some promise, further in-depth investigation is needed to confirm the validity of this hierarchy

    Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using two-pass decision tree construction

    Get PDF
    This paper demonstrates how unsupervised cross-lingual adaptation of HMM-based speech synthesis models may be performed without explicit knowledge of the adaptation data language. A two-pass decision tree construction technique is deployed for this purpose. Using parallel translated datasets, cross-lingual and intralingual adaptation are compared in a controlled manner. Listener evaluations reveal that the proposed method delivers performance approaching that of unsupervised intralingual adaptation

    Transfer Learning for Speech and Language Processing

    Full text link
    Transfer learning is a vital technique that generalizes models trained for one setting or task to other settings or tasks. For example in speech recognition, an acoustic model trained for one language can be used to recognize speech in another language, with little or no re-training data. Transfer learning is closely related to multi-task learning (cross-lingual vs. multilingual), and is traditionally studied in the name of `model adaptation'. Recent advance in deep learning shows that transfer learning becomes much easier and more effective with high-level abstract features learned by deep models, and the `transfer' can be conducted not only between data distributions and data types, but also between model structures (e.g., shallow nets and deep nets) or even model types (e.g., Bayesian models and neural models). This review paper summarizes some recent prominent research towards this direction, particularly for speech and language processing. We also report some results from our group and highlight the potential of this very interesting research field.Comment: 13 pages, APSIPA 201

    Personalising speech-to-speech translation:Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis

    Get PDF
    In this paper we present results of unsupervised cross-lingual speaker adaptation applied to text-to-speech synthesis. The application of our research is the personalisation of speech-to-speech translation in which we employ a HMM statistical framework for both speech recognition and synthesis. This framework provides a logical mechanism to adapt synthesised speech output to the voice of the user by way of speech recognition. In this work we present results of several different unsupervised and cross-lingual adaptation approaches as well as an end-to-end speaker adaptive speech-to-speech translation system. Our experiments show that we can successfully apply speaker adaptation in both unsupervised and cross-lingual scenarios and our proposed algorithms seem to generalise well for several language pairs. We also discuss important future directions including the need for better evaluation metrics

    Cross-Lingual Speaker Discrimination Using Natural and Synthetic Speech

    Get PDF
    This paper describes speaker discrimination experiments in which native English listeners were presented with either natural speech stimuli in English and Mandarin, synthetic speech stimuli in English and Mandarin, or natural Mandarin speech and synthetic English speech stimuli. In each experiment, listeners were asked to decide whether they thought the sentences were spoken by the same person or not. We found that the results for Mandarin/English speaker discrimination are very similar to results found in previous work on German/English and Finnish/English speaker discrimination. We conclude from this and previous work that listeners are able to identify speakers across languages and they are able to identify speakers across speech types, but the combination of these two factors leads to a speaker discrimination task which is too difficult for listeners to perform successfully, given the quality of across-language speaker adapted speech synthesis at present. Index Terms: speaker discrimination, speaker adaptation, HMM-based speech synthesi

    Speech Synthesis Based on Hidden Markov Models

    Get PDF
    corecore