693 research outputs found
Exploiting pitch dynamics for speech spectral estimation using a two-dimensional processing framework
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.Includes bibliographical references (p. 133-135).This thesis addresses the problem of obtaining an accurate spectral representation of speech formant structure when the voicing source exhibits a high fundamental frequency. Our work is inspired by auditory perception and physiological modeling studies implicating the use of temporal changes in speech by humans. Specifically, we develop and evaluate signal processing schemes that exploit temporal change of pitch as a basis for high-pitch formant estimation. As part of our development, we assess the source-filter separation capabilities of several two-dimensional processing schemes that utilize both standard spectrographic and auditory-based time-frequency representations. Our methods show quantitative improvements under certain conditions over representations derived from traditional and homomorphic linear prediction. We conclude by highlighting potential benefits of our framework in the particular application of speaker recognition with preliminary results indicating a performance gender-gap closure on subsets of the TIMIT corpus.by Tianyu Tom Wang.S.M
Acoustic variability and automatic recognition of children’s speech
International audienc
The role of gesture delay in coda /r/ weakening: an articulatory, auditory and acoustic study
The cross-linguistic tendency of coda consonants to weaken, vocalize, or be deleted is shown to
have a phonetic basis, resulting from gesture reduction, or variation in gesture timing. This study
investigates the effects of the timing of the anterior tongue gesture for coda /r/ on acoustics and
perceived strength of rhoticity, making use of two sociolects of Central Scotland (working- and
middle-class) where coda /r/ is weakening and strengthening, respectively. Previous articulatory
analysis revealed a strong tendency for these sociolects to use different coda /r/ tongue configurations—working-
and middle-class speakers tend to use tip/front raised and bunched variants,
respectively; however, this finding does not explain working-class /r/ weakening. A correlational
analysis in the current study showed a robust relationship between anterior lingual gesture timing,
F3, and percept of rhoticity. A linear mixed effects regression analysis showed that both speaker
social class and linguistic factors (word structure and the checked/unchecked status of the prerhotic
vowel) had significant effects on tongue gesture timing and formant values. This study provides further
evidence that gesture delay can be a phonetic mechanism for coda rhotic weakening and apparent
loss, but social class emerges as the dominant factor driving lingual gesture timing variation
Recommended from our members
A novel framework for high-quality voice source analysis and synthesis
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.The analysis, parameterization and modeling of voice source estimates obtained via inverse filtering of recorded speech are some of the most challenging areas of speech processing owing to the fact humans produce a wide range of voice source realizations and that the voice source estimates commonly contain artifacts due to the non-linear time-varying source-filter coupling. Currently, the most widely adopted representation of voice source signal is Liljencrants-Fant's (LF) model which was developed in late 1985. Due to the overly simplistic interpretation of voice source dynamics, LF model can not represent the fine temporal structure of glottal flow derivative realizations nor can it carry the sufficient spectral richness to facilitate a truly natural sounding speech synthesis. In this thesis we have introduced Characteristic Glottal Pulse Waveform Parameterization and Modeling (CGPWPM) which constitutes an entirely novel framework for voice source analysis, parameterization and reconstruction. In comparative evaluation of CGPWPM and LF model we have demonstrated that the proposed method is able to preserve higher levels of speaker dependant information from the voice source estimates and realize a more natural sounding speech synthesis. In general, we have shown that CGPWPM-based speech synthesis rates highly on the scale of absolute perceptual acceptability and that speech signals are faithfully reconstructed on consistent basis, across speakers, gender. We have applied CGPWPM to voice quality profiling and text-independent voice quality conversion method. The proposed voice conversion method is able to achieve the desired perceptual effects and the modified
speech remained as natural sounding and intelligible as natural speech. In this thesis, we have also developed an optimal wavelet thresholding strategy for voice source signals which is able to suppress aspiration noise and still retain both the slow and the rapid variations in the voice source estimate
Refining a Deep Learning-based Formant Tracker using Linear Prediction Methods
In this study, formant tracking is investigated by refining the formants
tracked by an existing data-driven tracker, DeepFormants, using the formants
estimated in a model-driven manner by linear prediction (LP)-based methods. As
LP-based formant estimation methods, conventional covariance analysis (LP-COV)
and the recently proposed quasi-closed phase forward-backward (QCP-FB) analysis
are used. In the proposed refinement approach, the contours of the three lowest
formants are first predicted by the data-driven DeepFormants tracker, and the
predicted formants are replaced frame-wise with local spectral peaks shown by
the model-driven LP-based methods. The refinement procedure can be plugged into
the DeepFormants tracker with no need for any new data learning. Two refined
DeepFormants trackers were compared with the original DeepFormants and with
five known traditional trackers using the popular vocal tract resonance (VTR)
corpus. The results indicated that the data-driven DeepFormants trackers
outperformed the conventional trackers and that the best performance was
obtained by refining the formants predicted by DeepFormants using QCP-FB
analysis. In addition, by tracking formants using VTR speech that was corrupted
by additive noise, the study showed that the refined DeepFormants trackers were
more resilient to noise than the reference trackers. In general, these results
suggest that LP-based model-driven approaches, which have traditionally been
used in formant estimation, can be combined with a modern data-driven tracker
easily with no further training to improve the tracker's performance.Comment: Computer Speech and Language, Vol. 81, Article 101515, June 202
- …