51 research outputs found

    A Hybrid Parameterization Technique for Speaker Identification

    Get PDF
    Classical parameterization techniques for Speaker Identification use the codification of the power spectral density of raw speech, not discriminating between articulatory features produced by vocal tract dynamics (acoustic-phonetics) from glottal source biometry. Through the present paper a study is conducted to separate voicing fragments of speech into vocal and glottal components, dominated respectively by the vocal tract transfer function estimated adaptively to track the acoustic-phonetic sequence of the message, and by the glottal characteristics of the speaker and the phonation gesture. The separation methodology is based in Joint Process Estimation under the un-correlation hypothesis between vocal and glottal spectral distributions. Its application on voiced speech is presented in the time and frequency domains. The parameterization methodology is also described. Speaker Identification experiments conducted on 245 speakers are shown comparing different parameterization strategies. The results confirm the better performance of decoupled parameterization compared against approaches based on plain speech parameterization

    The relationships among physiological, acoustical, and perceptual measures of vocal effort

    Full text link
    The purpose of this work was to explore the physiological mechanisms of vocal effort, the acoustical manifestation of vocal effort, and the perceptual interpretation of vocal effort by speakers and listeners. The first study evaluated four proposed mechanisms of vocal effort specific to the larynx: intrinsic laryngeal tension, extrinsic laryngeal tension, supraglottal compression, and subglottal pressure. Twenty-six healthy adults produced modulations of vocal effort (mild, moderate, maximal) and rate (slow, typical, fast), followed by self-ratings of vocal effort on a visual analog scale. Ten physiological measures across the four hypothesized mechanisms were captured via high-speed flexible laryngoscopy, surface electromyography, and neck-surface accelerometry. A mixed-effects backward stepwise regression analysis revealed that estimated subglottal pressure, mediolateral supraglottal compression, and a normalized percent activation of extrinsic suprahyoid muscles significantly increased as ratings of vocal effort increased (R2 = .60). The second study had twenty inexperienced listeners rate vocal effort on the speech recordings from the first study (typical, mild, moderate, and maximal effort) via a visual sort-and-rate method. A set of acoustical measures were calculated, including amplitude-, time-, spectral-, and cepstral-based measures. Two separate mixed-effects regression models determined the relationship between the acoustical predictors and speaker and listener ratings. Results indicated that mean sound pressure level, low-to-high spectral ratio, and harmonic-to-noise ratio significantly predicted speaker and listener ratings. Mean fundamental frequency (measured as change in semitones from typical productions) and relative fundamental frequency offset cycle 10 were also significant predictors of listener ratings. The acoustical predictors accounted for 72% and 82% of the variance in speaker and listener ratings, respectively. Speaker and listener ratings were also highly correlated (average r = .86). From these two studies, we determined that vocal effort is a complex physiological process that is mediated by changes in laryngeal configuration and subglottal pressure. The self-perception of vocal effort is related to the acoustical properties underlying these physiological changes. Listeners appear to rely on the same acoustical manifestations as speakers, yet incorporate additional time-based acoustical cues during perceptual judgments. Future work should explore the physiological, acoustical, and perceptual measures identified here in speakers with voice disorders.2019-07-06T00:00:00

    A Voice Disease Detection Method Based on MFCCs and Shallow CNN

    Full text link
    The incidence rate of voice diseases is increasing year by year. The use of software for remote diagnosis is a technical development trend and has important practical value. Among voice diseases, common diseases that cause hoarseness include spasmodic dysphonia, vocal cord paralysis, vocal nodule, and vocal cord polyp. This paper presents a voice disease detection method that can be applied in a wide range of clinical. We cooperated with Xiangya Hospital of Central South University to collect voice samples from sixty-one different patients. The Mel Frequency Cepstrum Coefficient (MFCC) parameters are extracted as input features to describe the voice in the form of data. An innovative model combining MFCC parameters and single convolution layer CNN is proposed for fast calculation and classification. The highest accuracy we achieved was 92%, it is fully ahead of the original research results and internationally advanced. And we use Advanced Voice Function Assessment Databases (AVFAD) to evaluate the generalization ability of the method we proposed, which achieved an accuracy rate of 98%. Experiments on clinical and standard datasets show that for the pathological detection of voice diseases, our method has greatly improved in accuracy and computational efficiency

    Automated voice pathology discrimination from audio recordings benefits from phonetic analysis of continuous speech

    Get PDF
    In this paper we evaluate the hypothesis that automated methods for diagnosis of voice disorders from speech recordings would benefit from contextual information found in continuous speech. Rather than basing a diagnosis on how disorders affect the average acoustic properties of the speech signal, the idea is to exploit the possibility that different disorders will cause different acoustic changes within different phonetic contexts. Any differences in the pattern of effects across contexts would then provide additional information for discrimination of pathologies. We evaluate this approach using two complementary studies: the first uses a short phrase which is automatically annotated using a phonetic transcription, the second uses a long reading passage which is automatically annotated from text. The first study uses a single sentence recorded from 597 speakers in the Saarbrucken Voice Database to discriminate structural from neurogenic disorders. The results show that discrimination performance for these broad pathology classes improves from 59% to 67% unweighted average recall when classifiers are trained for each phone-label and the results fused. Although the phonetic contexts improved discrimination, the overall sensitivity and specificity of the method seems insufficient for clinical application. We hypothesise that this is because of the limited contexts in the speech audio and the heterogeneous nature of the disorders. In the second study we address these issues by processing recordings of a long reading passage obtained from clinical recordings of 60 speakers with either Spasmodic Dysphonia or Vocal fold Paralysis. We show that discrimination performance increases from 80% to 87% unweighted average recall if classifiers are trained for each phone-labelled region and predictions fused. We also show that the sensitivity and specificity of a diagnostic test with this performance is similar to other diagnostic procedures in clinical use. In conclusion, the studies confirm that the exploitation of contextual differences in the way disorders affect speech improves automated diagnostic performance, and that automated methods for phonetic annotation of reading passages are robust enough to extract useful diagnostic information

    Acoustic and videoendoscopic techniques to improve voice assessment via relative fundamental frequency

    Get PDF
    Quantitative measures of laryngeal muscle tension are needed to improve assessment and track clinical progress. Although relative fundamental frequency (RFF) shows promise as an acoustic estimate of laryngeal muscle tension, it is not yet transferable to the clinic. The purpose of this work was to refine algorithmic estimation of RFF, as well as to enhance the knowledge surrounding the physiological underpinnings of RFF. The first study used a large database of voice samples collected from 227 speakers with voice disorders and 256 typical speakers to evaluate the effects of fundamental frequency estimation techniques and voice sample characteristics on algorithmic RFF estimation. By refining fundamental frequency estimation using the Auditory Sawtooth Waveform Inspired Pitch Estimator—Prime (Auditory-SWIPEâ€Č) algorithm and accounting for sample characteristics via the acoustic measure, pitch strength, algorithmic errors related to the accuracy and precision of RFF were reduced by 88.4% and 17.3%, respectively. The second study sought to characterize the physiological factors influencing acoustic outputs of RFF estimation. A group of 53 speakers with voice disorders and 69 typical speakers each produced the utterance, /ifi/, while simultaneous recordings were collected using a microphone and flexible nasendoscope. Acoustic features calculated via the microphone signal were examined in reference to the physiological initiation and termination of vocal fold vibration. The features that corresponded with these transitions were then implemented into the RFF algorithm, leading to significant improvements in the precision of the RFF algorithm to reflect the underlying physiological mechanisms for voicing offsets (p < .001, V = .60) and onsets (p < .001, V = .54) when compared to manual RFF estimation. The third study further elucidated the physiological underpinnings of RFF by examining the contribution of vocal fold abduction to RFF during intervocalic voicing offsets. Vocal fold abductory patterns were compared to RFF values in a subset of speakers from the second study, comprising young adults, older adults, and older adults with Parkinson’s disease. Abductory patterns were not significantly different among the three groups; however, vocal fold abduction was observed to play a significant role in measures of RFF at voicing offset. By improving algorithmic estimation and elucidating aspects of the underlying physiology affecting RFF, this work adds to the utility of RFF for use in conjunction with current clinical techniques to assess laryngeal muscle tension.2021-09-29T00:00:00

    Evaluating the translational potential of relative fundamental frequency

    Get PDF
    Relative fundamental frequency (RFF) is an acoustic measure that quantifies short-term changes in fundamental frequency during voicing transitions surrounding a voiceless consonant. RFF is hypothesized to be decreased by increased laryngeal tension during voice production and has been considered a potential objective measure of vocal hyperfunction. Previous studies have supported claims that decreased RFF values may indicate the severity of vocal hyperfunction and have attempted to improve the methods to obtain RFF. In order to make progress towards developing RFF into a clinical measure, this dissertation aimed to investigate further the validity and reliability of RFF. Specifically, we examined the underlying physiological mechanisms, the auditory-perceptual relationship with strained voice quality, and test-retest reliability. The first study evaluated one of the previously hypothesized physiological mechanisms for RFF, vocal fold abduction. Vocal fold kinematics and RFF were obtained from both younger and older typical speakers producing RFF stimuli with voiceless fricatives and stops during high-speed videoendoscopy. We did not find any statistical differences between younger and older speakers, but we found that vocal folds were less adducted and RFF was lower at voicing onset after the voiceless stop compared to the fricative. This finding is in accordance with the hypothesized positive association between vocal fold contact area during voicing transitions and RFF. The second study examined the relationship between RFF and strain, a major auditory-perceptual feature of vocal hyperfunction. RFF values were synthetically modified by exchanging the RFF contours between voice samples that were produced with a comfortable voice and with maximum vocal effort, while other acoustic features remained constant. We observed that comfortable voice samples with the RFF values of maximum vocal effort samples had increased strain ratings, whereas maximum vocal effort samples with the RFF values of comfortable voice samples had decreased strain ratings. These findings support the contribution of RFF to perceived strain. The third study compared the test-retest reliability of RFF with that of conventional voice measures. We recorded individuals with healthy voices during five consecutive days and obtained acoustic, aerodynamic, and auditory-perceptual measures from the recordings. RFF was comparably reliable as acoustic and aerodynamic measures and more reliable than auditory-perceptual measures. This dissertation supports the translational potential of RFF by providing empirical evidence of the physiological mechanisms of RFF, the relationship between RFF and perceived strain, and test-retest reliability of RFF. Clinical applications of RFF are expected to improve objective diagnosis and assessment of vocal hyperfunction, and thus to lead to better voice care for individuals with vocal hyperfunction.2021-09-25T00:00:00

    Models and analysis of vocal emissions for biomedical applications: 5th International Workshop: December 13-15, 2007, Firenze, Italy

    Get PDF
    The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies. The Workshop has the sponsorship of: Ente Cassa Risparmio di Firenze, COST Action 2103, Biomedical Signal Processing and Control Journal (Elsevier Eds.), IEEE Biomedical Engineering Soc. Special Issues of International Journals have been, and will be, published, collecting selected papers from the conference

    Models and Analysis of Vocal Emissions for Biomedical Applications

    Get PDF
    The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies

    Acoustic measurement of overall voice quality in sustained vowels and continuous speech

    Get PDF
    Measurement of dysphonia severity involves auditory-perceptual evaluations and acoustic analyses of sound waves. Meta-analysis of proportional associations between these two methods showed that many popular perturbation metrics and noise-to-harmonics and others ratios do not yield reasonable results. However, this meta-analysis demonstrated that the validity of specific autocorrelation- and cepstrum-based measures was much more convincing, and appointed ‘smoothed cepstral peak prominence’ as the most promising metric of dysphonia severity. Original research confirmed this inferiority of perturbation measures and superiority of cepstral indices in dysphonia measurement of laryngeal-vocal and tracheoesophageal voice samples. However, to be truly representative for daily voice use patterns, measurement of overall voice quality is ideally founded on the analysis of sustained vowels ánd continuous speech. A customized method for including both sample types and calculating the multivariate Acoustic Voice Quality Index (i.e., AVQI), was constructed for this purpose. Original study of the AVQI revealed acceptable results in terms of initial concurrent validity, diagnostic precision, internal and external cross-validity and responsiveness to change. It thus was concluded that the AVQI can track changes in dysphonia severity across the voice therapy process. There are many freely and commercially available computer programs and systems for acoustic metrics of dysphonia severity. We investigated agreements and differences between two commonly available programs (i.e., Praat and Multi-Dimensional Voice Program) and systems. The results indicated that clinicians better not compare frequency perturbation data across systems and programs and amplitude perturbation data across systems. Finally, acoustic information can also be utilized as a biofeedback modality during voice exercises. Based on a systematic literature review, it was cautiously concluded that acoustic biofeedback can be a valuable tool in the treatment of phonatory disorders. When applied with caution, acoustic algorithms (particularly cepstrum-based measures and AVQI) have merited a special role in assessment and/or treatment of dysphonia severity
    • 

    corecore