4 research outputs found

    Methods for speaking style conversion from normal speech to high vocal effort speech

    Get PDF
    This thesis deals with vocal-effort-focused speaking style conversion (SSC). Specifically, we studied two topics on conversion of normal speech to high vocal effort. The first topic involves the conversion of normal speech to shouted speech. We employed this conversion in a speaker recognition system with vocal effort mismatch between test and enrollment utterances (shouted speech vs. normal speech). The mismatch causes a degradation of the system's speaker identification performance. As solution, we proposed a SSC system that included a novel spectral mapping, used along a statistical mapping technique, to transform the mel-frequency spectral energies of normal speech enrollment utterances towards their counterparts in shouted speech. We evaluated the proposed solution by comparing speaker identification rates for a state-of-the-art i-vector-based speaker recognition system, with and without applying SSC to the enrollment utterances. Our results showed that applying the proposed SSC pre-processing to the enrollment data improves considerably the speaker identification rates. The second topic involves a normal-to-Lombard speech conversion. We proposed a vocoder-based parametric SSC system to perform the conversion. This system first extracts speech features using the vocoder. Next, a mapping technique, robust to data scarcity, maps the features. Finally, the vocoder synthesizes the mapped features into speech. We used two vocoders in the conversion system, for comparison: a glottal vocoder and the widely used STRAIGHT. We assessed the converted speech from the two vocoder cases with two subjective listening tests that measured similarity to Lombard speech and naturalness. The similarity subjective test showed that, for both vocoder cases, our proposed SSC system was able to convert normal speech to Lombard speech. The naturalness subjective test showed that the converted samples using the glottal vocoder were clearly more natural than those obtained with STRAIGHT

    EMALG: An Enhanced Mandarin Lombard Grid Corpus with Meaningful Sentences

    Full text link
    This study investigates the Lombard effect, where individuals adapt their speech in noisy environments. We introduce an enhanced Mandarin Lombard grid (EMALG) corpus with meaningful sentences , enhancing the Mandarin Lombard grid (MALG) corpus. EMALG features 34 speakers and improves recording setups, addressing challenges faced by MALG with nonsense sentences. Our findings reveal that in Mandarin, female exhibit a more pronounced Lombard effect than male, particularly when uttering meaningful sentences. Additionally, we uncover that nonsense sentences negatively impact Lombard effect analysis. Moreover, our results reaffirm the consistency in the Lombard effect comparison between English and Mandarin found in previous research

    Augmented CycleGANs for continuous scale normal-to-Lombard speaking style conversion

    No full text
    Lombard speech is a speaking style associated with increased vocal effort that is naturally used by humans to improve intelligibility in the presence of noise. It is hence desirable to have a system capable of converting speech from normal to Lombard style. Moreover, it would be useful if one could adjust the degree of Lombardness in the converted speech so that the system is more adaptable to different noise environments. In this study, we propose the use of recently developed Augmented cycle-consistent adversarial networks (Augmented CycleGANs) for conversion between normal and Lombard speaking styles. The proposed system gives a smooth control on the degree of Lombardness of the mapped utterances by traversing through different points in the latent space of the trained model. We utilize a parametric approach that uses the Pulse Model in Log domain (PML) vocoder to extract features from normal speech that are then mapped to Lombard-style features using the Augmented CycleGAN. Finally, the mapped features are converted to Lombard speech with PML. The model is trained on multi-language data recorded in different noise conditions, and we compare its effectiveness to a previously proposed CycleGAN system in experiments for intelligibility and quality of mapped speech.Peer reviewe
    corecore