693 research outputs found

    Exploiting pitch dynamics for speech spectral estimation using a two-dimensional processing framework

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.Includes bibliographical references (p. 133-135).This thesis addresses the problem of obtaining an accurate spectral representation of speech formant structure when the voicing source exhibits a high fundamental frequency. Our work is inspired by auditory perception and physiological modeling studies implicating the use of temporal changes in speech by humans. Specifically, we develop and evaluate signal processing schemes that exploit temporal change of pitch as a basis for high-pitch formant estimation. As part of our development, we assess the source-filter separation capabilities of several two-dimensional processing schemes that utilize both standard spectrographic and auditory-based time-frequency representations. Our methods show quantitative improvements under certain conditions over representations derived from traditional and homomorphic linear prediction. We conclude by highlighting potential benefits of our framework in the particular application of speaker recognition with preliminary results indicating a performance gender-gap closure on subsets of the TIMIT corpus.by Tianyu Tom Wang.S.M

    The role of gesture delay in coda /r/ weakening: an articulatory, auditory and acoustic study

    Get PDF
    The cross-linguistic tendency of coda consonants to weaken, vocalize, or be deleted is shown to have a phonetic basis, resulting from gesture reduction, or variation in gesture timing. This study investigates the effects of the timing of the anterior tongue gesture for coda /r/ on acoustics and perceived strength of rhoticity, making use of two sociolects of Central Scotland (working- and middle-class) where coda /r/ is weakening and strengthening, respectively. Previous articulatory analysis revealed a strong tendency for these sociolects to use different coda /r/ tongue configurations—working- and middle-class speakers tend to use tip/front raised and bunched variants, respectively; however, this finding does not explain working-class /r/ weakening. A correlational analysis in the current study showed a robust relationship between anterior lingual gesture timing, F3, and percept of rhoticity. A linear mixed effects regression analysis showed that both speaker social class and linguistic factors (word structure and the checked/unchecked status of the prerhotic vowel) had significant effects on tongue gesture timing and formant values. This study provides further evidence that gesture delay can be a phonetic mechanism for coda rhotic weakening and apparent loss, but social class emerges as the dominant factor driving lingual gesture timing variation

    Refining a Deep Learning-based Formant Tracker using Linear Prediction Methods

    Full text link
    In this study, formant tracking is investigated by refining the formants tracked by an existing data-driven tracker, DeepFormants, using the formants estimated in a model-driven manner by linear prediction (LP)-based methods. As LP-based formant estimation methods, conventional covariance analysis (LP-COV) and the recently proposed quasi-closed phase forward-backward (QCP-FB) analysis are used. In the proposed refinement approach, the contours of the three lowest formants are first predicted by the data-driven DeepFormants tracker, and the predicted formants are replaced frame-wise with local spectral peaks shown by the model-driven LP-based methods. The refinement procedure can be plugged into the DeepFormants tracker with no need for any new data learning. Two refined DeepFormants trackers were compared with the original DeepFormants and with five known traditional trackers using the popular vocal tract resonance (VTR) corpus. The results indicated that the data-driven DeepFormants trackers outperformed the conventional trackers and that the best performance was obtained by refining the formants predicted by DeepFormants using QCP-FB analysis. In addition, by tracking formants using VTR speech that was corrupted by additive noise, the study showed that the refined DeepFormants trackers were more resilient to noise than the reference trackers. In general, these results suggest that LP-based model-driven approaches, which have traditionally been used in formant estimation, can be combined with a modern data-driven tracker easily with no further training to improve the tracker's performance.Comment: Computer Speech and Language, Vol. 81, Article 101515, June 202
    • …
    corecore