10,741 research outputs found

    Analysis of Speaker Clustering Strategies for HMM-Based Speech Synthesis

    Get PDF
    This paper describes a method for speaker clustering, with the application of building average voice models for speakeradaptive HMM-based speech synthesis that are a good basis for adapting to specific target speakers. Our main hypothesis is that using perceptually similar speakers to build the average voice model will be better than use unselected speakers, even if the amount of data available from perceptually similar speakers is smaller. We measure the perceived similarities among a group of 30 female speakers in a listening test and then apply multiple linear regression to automatically predict these listener judgements of speaker similarity and thus to identify similar speakers automatically. We then compare a variety of average voice models trained on either speakers who were perceptually judged to be similar to the target speaker, or speakers selected by the multiple linear regression, or a large global set of unselected speakers. We find that the average voice model trained on perceptually similar speakers provides better performance than the global model, even though the latter is trained on more data, confirming our main hypothesis. However, the average voice model using speakers selected automatically by the multiple linear regression does not reach the same level of performance. Index Terms: Statistical parametric speech synthesis, hidden Markov models, speaker adaptatio

    ACCDIST: A Metric for comparing speakers' accents

    Get PDF
    This paper introduces a new metric for the quantitative assessment of the similarity of speakers' accents. The ACCDIST metric is based on the correlation of inter-segment distance tables across speakers or groups. Basing the metric on segment similarity within a speaker ensures that it is sensitive to the speaker's pronunciation system rather than to his or her voice characteristics. The metric is shown to have an error rate of only 11% on the accent classification of speakers into 14 English regional accents of the British Isles, half the error rate of a metric based on spectral information directly. The metric may also be useful for cluster analysis of accent groups

    Automatic classification of speaker characteristics

    Get PDF

    Australian accent-based speaker classification

    Get PDF

    Leveraging native language information for improved accented speech recognition

    Full text link
    Recognition of accented speech is a long-standing challenge for automatic speech recognition (ASR) systems, given the increasing worldwide population of bi-lingual speakers with English as their second language. If we consider foreign-accented speech as an interpolation of the native language (L1) and English (L2), using a model that can simultaneously address both languages would perform better at the acoustic level for accented speech. In this study, we explore how an end-to-end recurrent neural network (RNN) trained system with English and native languages (Spanish and Indian languages) could leverage data of native languages to improve performance for accented English speech. To this end, we examine pre-training with native languages, as well as multi-task learning (MTL) in which the main task is trained with native English and the secondary task is trained with Spanish or Indian Languages. We show that the proposed MTL model performs better than the pre-training approach and outperforms a baseline model trained simply with English data. We suggest a new setting for MTL in which the secondary task is trained with both English and the native language, using the same output set. This proposed scenario yields better performance with +11.95% and +17.55% character error rate gains over baseline for Hispanic and Indian accents, respectively.Comment: Accepted at Interspeech 201

    Affect recognition from speech

    Get PDF

    Acoustic Analysis of Nigerian English Vowels Based on Accents

    Get PDF
    Accent has been widely acclaimed to be a major source of automatic speech recognition (ASR) performance degradation. Most ASR applications were developed with native English speaker speech samples not minding the fact that the majority of its potential users speaks English as a second language with a marked accent. Nigeria like most nations colonized by Britain, speaks English as official language despite being a multi-ethnic nation. This work explores the acoustic features of energy, fundamental frequency and the first three formats of the three major ethnic groups of Nigerian based on features extracted from five pure vowels of English obtained from subjects who are Nigerians. This research aimed at determining the differences or otherwise between the pronunciations of the three major ethnic nationalities in Nigeria to aid the development of ASR that is robust to NE accent. The results show that there exist significant differences between the mean values of the pure English vowels based on the pronunciation of the three major ethnics: Hausa, Ibo, and Yoruba. The differences can be explored to enhance the performance of ASR in recognition of NE
    • …
    corecore