207 research outputs found

    Normal-to-Lombard Adaptation of Speech Synthesis Using Long Short-Term Memory Recurrent Neural Networks

    Get PDF
    In this article, three adaptation methods are compared based on how well they change the speaking style of a neural network based text-to-speech (TTS) voice. The speaking style conversion adopted here is from normal to Lombard speech. The selected adaptation methods are: auxiliary features (AF), learning hidden unit contribution (LHUC), and fine-tuning (FT). Furthermore, four state-of-the-art TTS vocoders are compared in the same context. The evaluated vocoders are: GlottHMM, GlottDNN, STRAIGHT, and pulse model in log-domain (PML). Objective and subjective evaluations were conducted to study the performance of both the adaptation methods and the vocoders. In the subjective evaluations, speaking style similarity and speech intelligibility were assessed. In addition to acoustic model adaptation, phoneme durations were also adapted from normal to Lombard with the FT adaptation method. In objective evaluations and speaking style similarity tests, we found that the FT method outperformed the other two adaptation methods. In speech intelligibility tests, we found that there were no significant differences between vocoders although the PML vocoder showed slightly better performance compared to the three other vocoders.Peer reviewe

    Methods for speaking style conversion from normal speech to high vocal effort speech

    Get PDF
    This thesis deals with vocal-effort-focused speaking style conversion (SSC). Specifically, we studied two topics on conversion of normal speech to high vocal effort. The first topic involves the conversion of normal speech to shouted speech. We employed this conversion in a speaker recognition system with vocal effort mismatch between test and enrollment utterances (shouted speech vs. normal speech). The mismatch causes a degradation of the system's speaker identification performance. As solution, we proposed a SSC system that included a novel spectral mapping, used along a statistical mapping technique, to transform the mel-frequency spectral energies of normal speech enrollment utterances towards their counterparts in shouted speech. We evaluated the proposed solution by comparing speaker identification rates for a state-of-the-art i-vector-based speaker recognition system, with and without applying SSC to the enrollment utterances. Our results showed that applying the proposed SSC pre-processing to the enrollment data improves considerably the speaker identification rates. The second topic involves a normal-to-Lombard speech conversion. We proposed a vocoder-based parametric SSC system to perform the conversion. This system first extracts speech features using the vocoder. Next, a mapping technique, robust to data scarcity, maps the features. Finally, the vocoder synthesizes the mapped features into speech. We used two vocoders in the conversion system, for comparison: a glottal vocoder and the widely used STRAIGHT. We assessed the converted speech from the two vocoder cases with two subjective listening tests that measured similarity to Lombard speech and naturalness. The similarity subjective test showed that, for both vocoder cases, our proposed SSC system was able to convert normal speech to Lombard speech. The naturalness subjective test showed that the converted samples using the glottal vocoder were clearly more natural than those obtained with STRAIGHT

    Voice Disorders Secondary to Thyroidectomy: A Case Study

    Get PDF
    The thyroid is an important gland that aids in development. Located anteriorly at the base of the neck, the thyroid produces hormones that regulate metabolism. Thyroid dysfunction can lead to excess or reduced production of hormones known as hyper and hypothyroidism. Usually affecting women, hyper and hypothyroidism can be life-threatening. A well-known treatment is a thyroidectomy, or removal of the thyroid gland. Many people report vocal change secondary to thyroidectomy. Dysfunction can result from intubation during surgery or damage to laryngeal nerves and/or muscles. A participantā€™s low intensity and difficulty with projection prompted a case-study to examine the laryngeal area for differences. KayPENTAX Visi-Pitch and Videostrobe instrumentation were utilized to provide instant video feedback and acoustic parameters that were compared to typical parameters/structures. Using videostroboscopy, a rigid scope was placed over the base of the tongue containing a miniature camera. The participant phonated syllables such as ā€œeeā€ and ā€œahā€ from low to high pitch ranges at Cleveland State University (CSU). A voice sample was also analyzed through Visi-Pitch instrumentation to assess parameters including jitter, shimmer, and fundamental frequency among others. Structural and acoustic parameters from Cleveland State were compared to results from an Ear, Nose, and Throat (ENT) doctor also utilizing videostroboscopy. Both CSU and ENT results note structural and acoustic differences despite no reported laryngeal nerve damage post-thyroidectomy

    Voice Disorders Secondary to Thyroidectomy: A Case Study

    Get PDF
    The thyroid is an important gland that aids in development. Located anteriorly at the base of the neck, the thyroid produces hormones that regulate metabolism. Thyroid dysfunction can lead to excess or reduced production of hormones known as hyper and hypothyroidism. Usually affecting women, hyper and hypothyroidism can be life-threatening. A well-known treatment is a thyroidectomy, or removal of the thyroid gland. Many people report vocal change secondary to thyroidectomy. Dysfunction can result from intubation during surgery or damage to laryngeal nerves and/or muscles. A participantā€™s low intensity and difficulty with projection prompted a case-study to examine the laryngeal area for differences. KayPENTAX Visi-Pitch and Videostrobe instrumentation were utilized to provide instant video feedback and acoustic parameters that were compared to typical parameters/structures. Using videostroboscopy, a rigid scope was placed over the base of the tongue containing a miniature camera. The participant phonated syllables such as ā€œeeā€ and ā€œahā€ from low to high pitch ranges at Cleveland State University (CSU). A voice sample was also analyzed through Visi-Pitch instrumentation to assess parameters including jitter, shimmer, and fundamental frequency among others. Structural and acoustic parameters from Cleveland State were compared to results from an Ear, Nose, and Throat (ENT) doctor also utilizing videostroboscopy. Both CSU and ENT results note structural and acoustic differences despite no reported laryngeal nerve damage post-thyroidectomy

    Speaker sex effects on temporal and spectro-temporal measures of speech

    Get PDF
    This study investigated speaker sex differences in the temporal and spectro-temporal parameters of English monosyllabic words spoken by thirteen women and eleven men. Vowel and utterance duration were investigated. A number of formant frequency parameters were also analysed to assess the spectro-temporal dynamic structures of the monosyllabic words as a function of speaker sex. Absolute frequency changes were measured for the first (F1), second (F2), and third (F3) formant frequencies (Ī”F1, Ī”F2, and Ī”F3, respectively). Rates of these absolute formant frequency changes were also measured and calculated to yield measurements for rF1, rF2, and rF3. Normalised frequency changes (normĪ”F1, normĪ”F2, and normĪ”F3), and normalised rates of change (normrF1, normrF2, and normrF3) were also calculated. F2 locus equations were then derived from the F2 measurements taken at the onset and temporal mid points of the vowels. Results indicated that there were significant sex differences in the spectro-temporal parameters associated with F2: Ī”F2, normĪ”F2, rF2, and F2 locus equation slopes; women displayed significantly higher values for Ī”F2, normĪ”F2 and rF2, and significantly shallower F2 locus equation slopes. Collectively, these results suggested lower levels of coarticulation in the speech samples of the women speakers, and corroborate evidence reported in earlier studies

    The impact of a standardized vocal loading test on vocal fold oscillations

    Get PDF
    Introduction Vocal loading capacity is an important aspect of vocal health and is measured using standardized vocal loading tests. However, it remains unclear how vocal fold oscillation patterns are influenced by a standardized vocal loading task. Methods 21 (10 male, 11 female) vocally healthy subjects were analyzed concerning the dysphonia severity index (DSI) and high speed videolaryngoscopy (HSV) on the vowel /i/ at a comfortable pitch and loudness before and after a standardized vocal loading test (10 min standardized text reading, at a level higher than 80 dB (A) measured at 30 cm from the mouth). Results Changes in DSI were statistically significant, diminishing by 1.2 points after the vocal loading test, which was mainly caused by an increase of the minimum intensity. However, the pre-post comparison of HSV derived measures failed to show any statistically significant changes. Conclusion It seems necessary to analyze the effects of a standardized vocal loading test on vocal fold oscillation patterns with respect to softest phonation and phonation threshold pressure rather than comfortable pitch and loudness.Level of evidenc

    Reliability of laryngo-stroboscopic evaluation based on visual perceptual judgment

    Get PDF
    This study investigated the inter-rater and intra-rater reliability of visual perceptual evaluation of laryngo-stroboscopic images. Two hundreds and fifty-five laryngo-stroboscopic videos samples were collected from 75 subjects. Three raters undertook evaluation of the images on 4measurements: 1) mass lesion size, 2) amplitude of vocal fold vibration, 3) supraglottic activity, and 4) shape of the glottal closure, using the modified Stroboscopy Examination Rating Form (Poburka, 1999). Results showed that substantial inter-rater and intra-rater reliability were achieved for lesion size, antero-posterior supraglottic activity and glottal closure. However, evaluation of medio-lateral supraglottic activity and the amplitude of vocal fold vibration could not achieve an adequate reliability (ranged from 0.45-0.50). The finding indicated that laryngo-stroboscopic examination is a relatively reliable method for the measurement of lesion size, antero-posterior supraglottic compression and glottal closure. This finding is better than those reported in the literature (Nawka & Konerding, 2012). Meanwhile, the vocal fold vibratory amplitude measure was found to be the least reliable.published_or_final_versionSpeech and Hearing SciencesBachelorBachelor of Science in Speech and Hearing Science
    • ā€¦
    corecore