205 research outputs found
Analysis and Detection of Pathological Voice using Glottal Source Features
Automatic detection of voice pathology enables objective assessment and
earlier intervention for the diagnosis. This study provides a systematic
analysis of glottal source features and investigates their effectiveness in
voice pathology detection. Glottal source features are extracted using glottal
flows estimated with the quasi-closed phase (QCP) glottal inverse filtering
method, using approximate glottal source signals computed with the zero
frequency filtering (ZFF) method, and using acoustic voice signals directly. In
addition, we propose to derive mel-frequency cepstral coefficients (MFCCs) from
the glottal source waveforms computed by QCP and ZFF to effectively capture the
variations in glottal source spectra of pathological voice. Experiments were
carried out using two databases, the Hospital Universitario Principe de
Asturias (HUPA) database and the Saarbrucken Voice Disorders (SVD) database.
Analysis of features revealed that the glottal source contains information that
discriminates normal and pathological voice. Pathology detection experiments
were carried out using support vector machine (SVM). From the detection
experiments it was observed that the performance achieved with the studied
glottal source features is comparable or better than that of conventional MFCCs
and perceptual linear prediction (PLP) features. The best detection performance
was achieved when the glottal source features were combined with the
conventional MFCCs and PLP features, which indicates the complementary nature
of the features
PUHEEN TUOTTAMISEN KUVAAMINEN PARAMETROIMALLA KÄÄNTEISSUODATUKSELLA ESTIMOITU GLOTTISHERÄTE
Soinnillisen äänteen herätesignaali, värähtelevien äänihuulten välistä purkautuvaglottisheräte, voidaan estimoida käyttämällä ns. käänteissuodatusmenetelmää. Puheen tuottamisen analyysi muodostuu tällöin tyypillisesti kahdestavaiheesta: (a) glottisherätteen laskennasta käänteissuodatuksella ja (b) saatujenvirtauspulssijonojen parametroinnista. Jälkimmäisen vaiheen tarkoitus onkuvata puheen tuoton herätesignaalin oleellisin informaatio numeerisessamuodossa. Tässä artikkelissa tarkastellaan niitä menetelmiä, joita on kehitettyglottisherätteen parametrointiin. Menetelmät kuvataan jakamalla ne aika- jataajuusalueen tekniikoihin, ja jokaisen parametrin kohdalla on koostettutietoa niiden käyttösovellutuksista ja tyypillisistä arvoista. Lopuksi vertailIaantunnetuimpien tekniikoiden käytettävyyttä äänitutkimuksessa.Avainsanat: puheen tuottaminen, käänteissuodatus, glottisheräte, parametrointiEstimation of the source ofvoiced speech, the glottal volume velocity waveform, withinverse filtering involves usually a parameterisation stage, where the obtained flowwaveforms are expressed in numerical form. This stage of the voice source analysis, theparameterisation of the glottal flow, is discussed in the present paper. The paper aims togive a review of the different methods developed for the parameterisation and it discusseshow these parameters have reflected the function of the voice source in various voiceproduction studies.Keywords: speech production, inverse filtering, glottal excitation, parameterisatio
Refining a Deep Learning-based Formant Tracker using Linear Prediction Methods
In this study, formant tracking is investigated by refining the formants
tracked by an existing data-driven tracker, DeepFormants, using the formants
estimated in a model-driven manner by linear prediction (LP)-based methods. As
LP-based formant estimation methods, conventional covariance analysis (LP-COV)
and the recently proposed quasi-closed phase forward-backward (QCP-FB) analysis
are used. In the proposed refinement approach, the contours of the three lowest
formants are first predicted by the data-driven DeepFormants tracker, and the
predicted formants are replaced frame-wise with local spectral peaks shown by
the model-driven LP-based methods. The refinement procedure can be plugged into
the DeepFormants tracker with no need for any new data learning. Two refined
DeepFormants trackers were compared with the original DeepFormants and with
five known traditional trackers using the popular vocal tract resonance (VTR)
corpus. The results indicated that the data-driven DeepFormants trackers
outperformed the conventional trackers and that the best performance was
obtained by refining the formants predicted by DeepFormants using QCP-FB
analysis. In addition, by tracking formants using VTR speech that was corrupted
by additive noise, the study showed that the refined DeepFormants trackers were
more resilient to noise than the reference trackers. In general, these results
suggest that LP-based model-driven approaches, which have traditionally been
used in formant estimation, can be combined with a modern data-driven tracker
easily with no further training to improve the tracker's performance.Comment: Computer Speech and Language, Vol. 81, Article 101515, June 202
Severity Classification of Parkinson's Disease from Speech using Single Frequency Filtering-based Features
Developing objective methods for assessing the severity of Parkinson's
disease (PD) is crucial for improving the diagnosis and treatment. This study
proposes two sets of novel features derived from the single frequency filtering
(SFF) method: (1) SFF cepstral coefficients (SFFCC) and (2) MFCCs from the SFF
(MFCC-SFF) for the severity classification of PD. Prior studies have
demonstrated that SFF offers greater spectro-temporal resolution compared to
the short-time Fourier transform. The study uses the PC-GITA database, which
includes speech of PD patients and healthy controls produced in three speaking
tasks (vowels, sentences, text reading). Experiments using the SVM classifier
revealed that the proposed features outperformed the conventional MFCCs in all
three speaking tasks. The proposed SFFCC and MFCC-SFF features gave a relative
improvement of 5.8% and 2.3% for the vowel task, 7.0% & 1.8% for the sentence
task, and 2.4% and 1.1% for the read text task, in comparison to MFCC features.Comment: Accepted by INTERSPEECH 202
Disentangling the effects of phonation and articulation: Hemispheric asymmetries in the auditory N1m response of the human brain
BACKGROUND: The cortical activity underlying the perception of vowel identity has typically been addressed by manipulating the first and second formant frequency (F1 & F2) of the speech stimuli. These two values, originating from articulation, are already sufficient for the phonetic characterization of vowel category. In the present study, we investigated how the spectral cues caused by articulation are reflected in cortical speech processing when combined with phonation, the other major part of speech production manifested as the fundamental frequency (F0) and its harmonic integer multiples. To study the combined effects of articulation and phonation we presented vowels with either high (/a/) or low (/u/) formant frequencies which were driven by three different types of excitation: a natural periodic pulseform reflecting the vibration of the vocal folds, an aperiodic noise excitation, or a tonal waveform. The auditory N1m response was recorded with whole-head magnetoencephalography (MEG) from ten human subjects in order to resolve whether brain events reflecting articulation and phonation are specific to the left or right hemisphere of the human brain. RESULTS: The N1m responses for the six stimulus types displayed a considerable dynamic range of 115–135 ms, and were elicited faster (~10 ms) by the high-formant /a/ than by the low-formant /u/, indicating an effect of articulation. While excitation type had no effect on the latency of the right-hemispheric N1m, the left-hemispheric N1m elicited by the tonally excited /a/ was some 10 ms earlier than that elicited by the periodic and the aperiodic excitation. The amplitude of the N1m in both hemispheres was systematically stronger to stimulation with natural periodic excitation. Also, stimulus type had a marked (up to 7 mm) effect on the source location of the N1m, with periodic excitation resulting in more anterior sources than aperiodic and tonal excitation. CONCLUSION: The auditory brain areas of the two hemispheres exhibit differential tuning to natural speech signals, observable already in the passive recording condition. The variations in the latency and strength of the auditory N1m response can be traced back to the spectral structure of the stimuli. More specifically, the combined effects of the harmonic comb structure originating from the natural voice excitation caused by the fluctuating vocal folds and the location of the formant frequencies originating from the vocal tract leads to asymmetric behaviour of the left and right hemisphere
Parameterization of a computational physical model for glottal flow using inverse filtering and high-speed videoendoscopy
High-speed videoendoscopy, glottal inverse filtering, and physical modeling can be used to obtain complementary information about speech production. In this study, the three methodologies are combined to pursue a better understanding of the relationship between the glottal air flow and glottal area. Simultaneously acquired high-speed video and glottal inverse filtering data from three male and three female speakers were used. Significant correlations were found between the quasi-open and quasi-speed quotients of the glottal area (extracted from the high-speed videos) and glottal flow (estimated using glottal inverse filtering), but only the quasi-open quotient relationship could be represented as a linear model. A simple physical glottal flow model with three different glottal geometries was optimized to match the data. The results indicate that glottal flow skewing can be modeled using an inertial vocal/subglottal tract load and that estimated inertia within the glottis is sensitive to the quality of the data. Parameter optimisation also appears to favour combining the simplest glottal geometry with viscous losses and the more complex glottal geometries with entrance/exit effects in the glottis.Peer reviewe
Analysis of phonation onsets in vowel production, using information from glottal area and flow estimate
A multichannel dataset comprising high-speed videoendoscopy images, and electroglottography and free-field microphone signals, was used to investigate phonation onsets in vowel production. Use of the multichannel data enabled simultaneous analysis of the two main aspects of phonation, glottal area, extracted from the high-speed videoendoscopy images, and glottal flow, estimated from the microphone signal using glottal inverse filtering. Pulse-wise parameterization of the glottal area and glottal flow indicate that there is no single dominant way to initiate quasi-stable phonation. The trajectories of fundamental frequency and normalized amplitude quotient, extracted from glottal area and estimated flow, may differ markedly during onsets. The location and steepness of the amplitude envelopes of the two signals were observed to be closely related, and quantitative analysis supported the hypothesis that glottal area and flow do not carry essentially different amplitude information during vowel onsets. Linear models were used to predict the phonation onset times from the characteristics of the subsequent steady phonation. The phonation onset time of glottal area was found to have good predictability from a combination of the fundamental frequency and the normalized amplitude quotient of the glottal flow, as well as the gender of the speaker. For the phonation onset time of glottal flow, the best linear model was obtained using the fundamental frequency and the normalized amplitude quotient of the glottal flow as predictors.Peer reviewe
Time-Varying Quasi-Closed-Phase Analysis for Accurate Formant Tracking in Speech Signals
In this paper, we propose a new method for the accurate estimation and
tracking of formants in speech signals using time-varying quasi-closed-phase
(TVQCP) analysis. Conventional formant tracking methods typically adopt a
two-stage estimate-and-track strategy wherein an initial set of formant
candidates are estimated using short-time analysis (e.g., 10--50 ms), followed
by a tracking stage based on dynamic programming or a linear state-space model.
One of the main disadvantages of these approaches is that the tracking stage,
however good it may be, cannot improve upon the formant estimation accuracy of
the first stage. The proposed TVQCP method provides a single-stage formant
tracking that combines the estimation and tracking stages into one. TVQCP
analysis combines three approaches to improve formant estimation and tracking:
(1) it uses temporally weighted quasi-closed-phase analysis to derive
closed-phase estimates of the vocal tract with reduced interference from the
excitation source, (2) it increases the residual sparsity by using the
optimization and (3) it uses time-varying linear prediction analysis over long
time windows (e.g., 100--200 ms) to impose a continuity constraint on the vocal
tract model and hence on the formant trajectories. Formant tracking experiments
with a wide variety of synthetic and natural speech signals show that the
proposed TVQCP method performs better than conventional and popular formant
tracking tools, such as Wavesurfer and Praat (based on dynamic programming),
the KARMA algorithm (based on Kalman filtering), and DeepFormants (based on
deep neural networks trained in a supervised manner). Matlab scripts for the
proposed method can be found at: https://github.com/njaygowda/ftrac
- …