2,613 research outputs found

    Kalman tracking of linear predictor and harmonic noise models for noisy speech enhancement

    Get PDF
    This paper presents a speech enhancement method based on the tracking and denoising of the formants of a linear prediction (LP) model of the spectral envelope of speech and the parameters of a harmonic noise model (HNM) of its excitation. The main advantages of tracking and denoising the prominent energy contours of speech are the efficient use of the spectral and temporal structures of successive speech frames and a mitigation of processing artefact known as the ‘musical noise’ or ‘musical tones’.The formant-tracking linear prediction (FTLP) model estimation consists of three stages: (a) speech pre-cleaning based on a spectral amplitude estimation, (b) formant-tracking across successive speech frames using the Viterbi method, and (c) Kalman filtering of the formant trajectories across successive speech frames.The HNM parameters for the excitation signal comprise; voiced/unvoiced decision, the fundamental frequency, the harmonics’ amplitudes and the variance of the noise component of excitation. A frequency-domain pitch extraction method is proposed that searches for the peak signal to noise ratios (SNRs) at the harmonics. For each speech frame several pitch candidates are calculated. An estimate of the pitch trajectory across successive frames is obtained using a Viterbi decoder. The trajectories of the noisy excitation harmonics across successive speech frames are modeled and denoised using Kalman filters.The proposed method is used to deconstruct noisy speech, de-noise its model parameters and then reconstitute speech from its cleaned parts. Experimental evaluations show the performance gains of the formant tracking, pitch extraction and noise reduction stages

    Bio-inspired broad-class phonetic labelling

    Get PDF
    Recent studies have shown that the correct labeling of phonetic classes may help current Automatic Speech Recognition (ASR) when combined with classical parsing automata based on Hidden Markov Models (HMM).Through the present paper a method for Phonetic Class Labeling (PCL) based on bio-inspired speech processing is described. The methodology is based in the automatic detection of formants and formant trajectories after a careful separation of the vocal and glottal components of speech and in the operation of CF (Characteristic Frequency) neurons in the cochlear nucleus and cortical complex of the human auditory apparatus. Examples of phonetic class labeling are given and the applicability of the method to Speech Processing is discussed

    Speech Sensorimotor Learning through a Virtual Vocal Tract

    Get PDF
    Studies of speech sensorimotor learning often manipulate auditory feedback by modifying isolated acoustic parameters such as formant frequency or fundamental frequency using near real-time resynthesis of a participant\u27s speech. An alternative approach is to engage a participant in a total remapping of the sensorimotor working space using a virtual vocal tract. To support this approach for studying speech sensorimotor learning we have developed a system to control an articulatory synthesizer using electromagnetic articulography data. Articulator movement data from the NDI Wave System are streamed to a Maeda articulatory synthesizer. The resulting synthesized speech provides auditory feedback to the participant. This approach allows the experimenter to generate novel articulatory-acoustic mappings. Moreover, the acoustic output of the synthesizer can be perturbed using acoustic resynthesis methods. Since no robust speech-acoustic signal is required from the participant, this system will allow for the study of sensorimotor learning in any individuals, even those with severe speech disorders. In the current work we present preliminary results that demonstrate that typically-functioning participants can use a virtual vocal tract to produce diphthongs within a novel articulatory-acoustic workspace. Once sufficient baseline performance is established, perturbations to auditory feedback (formant shifting) can elicit compensatory and adaptive articulatory responses

    Exploring auditory-motor interactions in normal and disordered speech

    Full text link
    Auditory feedback plays an important role in speech motor learning and in the online correction of speech movements. Speakers can detect and correct auditory feedback errors at the segmental and suprasegmental levels during ongoing speech. The frontal brain regions that contribute to these corrective movements have also been shown to be more active during speech in persons who stutter (PWS) compared to fluent speakers. Further, various types of altered auditory feedback can temporarily improve the fluency of PWS, suggesting that atypical auditory-motor interactions during speech may contribute to stuttering disfluencies. To investigate this possibility, we have developed and improved Audapter, a software that enables configurable dynamic perturbation of the spatial and temporal content of the speech auditory signal in real time. Using Audapter, we have measured the compensatory responses of PWS to static and dynamic perturbations of the formant content of auditory feedback and compared these responses with those from matched fluent controls. Our findings indicate deficient utilization of auditory feedback by PWS for short-latency online control of the spatial and temporal parameters of articulation during vowel production and during running speech. These findings provide further evidence that stuttering is associated with aberrant auditory-motor integration during speech.Published versio

    Extraction of vocal-tract system characteristics from speechsignals

    Get PDF
    We propose methods to track natural variations in the characteristics of the vocal-tract system from speech signals. We are especially interested in the cases where these characteristics vary over time, as happens in dynamic sounds such as consonant-vowel transitions. We show that the selection of appropriate analysis segments is crucial in these methods, and we propose a selection based on estimated instants of significant excitation. These instants are obtained by a method based on the average group-delay property of minimum-phase signals. In voiced speech, they correspond to the instants of glottal closure. The vocal-tract system is characterized by its formant parameters, which are extracted from the analysis segments. Because the segments are always at the same relative position in each pitch period, in voiced speech the extracted formants are consistent across successive pitch periods. We demonstrate the results of the analysis for several difficult cases of speech signals

    A low-delay 8 Kb/s backward-adaptive CELP coder

    Get PDF
    Code excited linear prediction coding is an efficient technique for compressing speech sequences. Communications quality of speech can be obtained at bit rates below 8 Kb/s. However, relatively large coding delays are necessary to buffer the input speech in order to perform the LPC analysis. A low delay 8 Kb/s CELP coder is introduced in which the short term predictor is based on past synthesized speech. A new distortion measure that improves the tracking of the formant filter is discussed. Formal listening tests showed that the performance of the backward adaptive coder is almost as good as the conventional CELP coder

    Refining a Deep Learning-based Formant Tracker using Linear Prediction Methods

    Full text link
    In this study, formant tracking is investigated by refining the formants tracked by an existing data-driven tracker, DeepFormants, using the formants estimated in a model-driven manner by linear prediction (LP)-based methods. As LP-based formant estimation methods, conventional covariance analysis (LP-COV) and the recently proposed quasi-closed phase forward-backward (QCP-FB) analysis are used. In the proposed refinement approach, the contours of the three lowest formants are first predicted by the data-driven DeepFormants tracker, and the predicted formants are replaced frame-wise with local spectral peaks shown by the model-driven LP-based methods. The refinement procedure can be plugged into the DeepFormants tracker with no need for any new data learning. Two refined DeepFormants trackers were compared with the original DeepFormants and with five known traditional trackers using the popular vocal tract resonance (VTR) corpus. The results indicated that the data-driven DeepFormants trackers outperformed the conventional trackers and that the best performance was obtained by refining the formants predicted by DeepFormants using QCP-FB analysis. In addition, by tracking formants using VTR speech that was corrupted by additive noise, the study showed that the refined DeepFormants trackers were more resilient to noise than the reference trackers. In general, these results suggest that LP-based model-driven approaches, which have traditionally been used in formant estimation, can be combined with a modern data-driven tracker easily with no further training to improve the tracker's performance.Comment: Computer Speech and Language, Vol. 81, Article 101515, June 202

    Feedforward and feedback control in apraxia of speech: effects of noise masking on vowel production

    Full text link
    PURPOSE: This study was designed to test two hypotheses about apraxia of speech (AOS) derived from the Directions Into Velocities of Articulators (DIVA) model (Guenther et al., 2006): the feedforward system deficit hypothesis and the feedback system deficit hypothesis. METHOD: The authors used noise masking to minimize auditory feedback during speech. Six speakers with AOS and aphasia, 4 with aphasia without AOS, and 2 groups of speakers without impairment (younger and older adults) participated. Acoustic measures of vowel contrast, variability, and duration were analyzed. RESULTS: Younger, but not older, speakers without impairment showed significantly reduced vowel contrast with noise masking. Relative to older controls, the AOS group showed longer vowel durations overall (regardless of masking condition) and a greater reduction in vowel contrast under masking conditions. There were no significant differences in variability. Three of the 6 speakers with AOS demonstrated the group pattern. Speakers with aphasia without AOS did not differ from controls in contrast, duration, or variability. CONCLUSION: The greater reduction in vowel contrast with masking noise for the AOS group is consistent with the feedforward system deficit hypothesis but not with the feedback system deficit hypothesis; however, effects were small and not present in all individual speakers with AOS. Theoretical implications and alternative interpretations of these findings are discussed.R01 DC002852 - NIDCD NIH HHS; R01 DC007683 - NIDCD NIH HH
    • 

    corecore