119 research outputs found

    Exploring the contribution of voice quality to the perception of gender in Scottish English

    Get PDF
    This study investigates how voice quality, here phonation, affects listener perception of speaker gender, and how voice quality interacts with pitch, a major cue to speaker gender, when cueing gender perceptions. Gender differences in voice quality have been identified in both Scottish (Beck and Schaeffler 2015; Stuart-Smith 1999) and American English (Abdelli-Beruh et al. 2014; D. Klatt and L. Klatt 1990; Podesva 2013; Syrdal 1996; Wolk et al. 2012; Yuasa 2010). There is evidence from previous research that suggest gender differences in voice quality may also influence listener perception of speaker gender, with breathy voice being perceived as feminine or female characteristic by listeners (Addington 1968; Andrews and Schmidt 1997; Bishop and Keating 2012; Holmberg et al. 2010; Porter 2012; Skuk and Schweinberger 2014; Van Borsel et al. 2009) and creaky voice being perceived as masculine characteristic (Greer 2015; Lee 2016). However, some studies have found that voice quality has little effect (Booz and Ferguson 2016; King et al. 2012; Owen and Hancock 2010). The present study seeks to investigate the contribution of voice quality, taking into account the various methods of producing voice quality differences in stimuli, cultural differences in gendered meanings of voice quality, and different methods of quantifying ‘perceived gender’, which may contribute to the conflicting results of previous studies. To investigate the contribution of voice quality to perceptions of speaker gender, a perception experiment was be carried out where 32 Scottish listeners and 40 North American listeners heard stimuli with different voice qualities (modal, breathy, creaky) and at different pitch levels (120Hz, 165Hz, 210Hz), and were asked to make judgements about the gender of the speaker. Differences in voice quality were produced by a speaker with the ability to create voice quality distinctions, as well as created through copy synthesis from the speaker’s voice. Listeners were asked to indicate whether they thought the voice belonged to a man or a woman and rate how masculine and feminine the voice sounded. Relative to modal voice, I predicted that listeners would be more likely to categorise breathy voices as women, and would rate them as more feminine and less masculine, and that listeners would be less likely to categorise creaky voices as women, and would rate them as more masculine and less feminine. I also predicted that there might be differences in how Scottish listeners and North American listeners perceived voice quality, given that the gender differences in voice quality in these two varieties of English have been found to differ in previous research. Consistent with my predictions, I found that relative to modal voice, listeners were more likely to categorise breathy voice stimuli as women, and rated breathy voice stimuli as more feminine and less masculine. However, in contrast with my predictions, I found that relative to modal voice, listeners were more likely to categorise creaky voice stimuli as women, and rated them as less masculine, but not more feminine. Furthermore, contrary to predictions, I did not identify differences between Scottish and North American listeners in terms of voice quality perception. Differences were also found in how breathy and creaky voice influence gender perception at different pitch levels. Overall, these results show that voice quality has an important influence on listener perception of speaker gender, and that the gendered meanings of creaky voice are changing and have disassociated from its low pitch. Future research should consider whether this evaluation among Scottish listeners this may reflect a wider change in the gender differences in production

    On the use of voice descriptors for glottal source shape parameter estimation

    Get PDF
    International audienceThis paper summarizes the results of our investigations into estimating the shape of the glottal excitation source from speech signals. We employ the Liljencrants-Fant (LF) model describing the glottal flow and its derivative. The one-dimensional glottal source shape parameter Rd describes the transition in voice quality from a tense to a breathy voice. The parameter Rd has been derived from a statistical regression of the R waveshape parameters which parameterize the LF model. First, we introduce a variant of our recently proposed adaptation and range extension of the Rd parameter regression. Secondly, we discuss in detail the aspects of estimating the glottal source shape parameter Rd using the phase minimization paradigm. Based on the analysis of a large number of speech signals we describe the major conditions that are likely to result in erroneous Rd estimates. Based on these findings we investigate into means to increase the robustness of the Rd parameter estimation. We use Viterbi smoothing to suppress unnatural jumps of the estimated Rd parameter contours within short time segments. Additionally, we propose to steer the Viterbi algorithm by exploiting the covariation of other voice descriptors to improve Viterbi smoothing. The novel Viterbi steering is based on a Gaussian Mixture Model (GMM) that represents the joint density of the voice descriptors and the Open Quotient (OQ) estimated from corresponding electroglottographic (EGG) signals. A conversion function derived from the mixture model predicts OQ from the voice descriptors. Converted to Rd it defines an additional prior probability to adapt the partial probabilities of the Viterbi algorithm accordingly. Finally, we evaluate the performances of the phase minimization based methods using both variants to adapt and extent the Rd regression on one synthetic test set as well as in combination with Viterbi smoothing and each variant of the novel Viterbi steering on one test set of natural speech. The experimental findings exhibit improvements for both Viterbi approaches

    Comparison of laryngoscopic, glottal and vibratory parameters among Estill qualities – Case study

    Get PDF
    Estill Voice Training (EVT) is an effective educational system for developing and controlling distinct voice qualities used in contemporary commercial singing. EVT teaches six vocal qualities that differ at 13 levels. This study aims to investigate whether the distinct vocal qualities taught by EVT can be systematically differentiated based on laryngoscopic observations and vocal fold oscillation parameters. To investigate the differences in six EVT qualities, laryngeal dimensions and glottal area waveform parameters were measured in a single female subject who performed it in one-octave scale. Glottis Analysis Tools (GAT) were used to measure these parameters and phonovibrograms were obtained from the analysis. The resulting data were subjected to factor analysis to identify the systematic differences between EVT qualities. High-speed videolaryngoscopy analysis revealed a significant influence of vocal qualities on vocal fold oscillations. The factor analysis of the data identified three factors based on laryngeal dimension and four factors derived from GAT parameters. The first GAT factor was influenced by posterior adduction and distinguished belt quality from other qualities, suggesting a significant influence of the aryepiglottic sphincter. The second GAT factor contained parameters derived from glottal length and amplitude, suggesting a relationship not only with vocal registers but also with laryngeal height. The third GAT factor was best related to body-cover figure and phonation type (membranous medialization), while the fourth GAT factor was related to the amplitude-length-ratio. These findings suggest that vocal fold oscillations can be used to distinguish between Estill voice qualities

    Ihmisen äänentuoton analysointi käänteissuodatuksen, suurnopeuskuvauksen ja elektroglottografian avulla

    Get PDF
    Human voice production was studied using three methods: inverse filtering, digital high-speed imaging of the vocal folds, and electroglottography. The primary goal was to evaluate an inverse filtering method by comparing inverse filtered glottal flow estimates with information obtained by the other methods. More detailed examination of the human voice source behavior was also included in the work. Material from two experiments was analyzed in this study. The data of the first experiment consisted of simultaneous recordings of acoustic speech signal, electroglottogram, and high-speed imaging acquired during sustained vowel phonations. Inverse filtered glottal flow estimates were compared with glottal area waveforms derived from the image material by calculating pulse shape parameters from the signals. The material of the second experiment included recordings of acoustic speech signal and electroglottogram during phonations of sustained vowels. This material was utilized for the analysis of the opening phase and the closing phase of vocal fold vibration. The evaluated inverse filtering method was found to produce mostly reasonable estimates of glottal flow. However, the parameters of the system have to be set appropriately, which requires experience on inverse filtering and speech production. The flow estimates often showed a two-stage opening phase with two instants of rapid increase in the flow derivative. The instant of glottal opening detected in the electroglottogram was often found to coincide with an increase in the flow derivative. The instant of minimum flow derivative was found to occur mostly during the last quarter of the closing phase and it was shown to precede the closing peak of the differentiated electroglottogram.Ihmisen puheentuottoa tutkittiin kolmella menetelmällä: käänteissuodatuksella, äänihuulten digitaalisella suurnopeuskuvauksella ja elektroglottografialla. Päätavoitteena oli tarkastella erään käänteissuodatusmenetelmän toimintaa vertailemalla näillä menetelmillä saatua informaatiota äänihuulten värähtelystä. Lisäksi tutkittiin tarkemmin eräitä äänilähteen käyttäytymisen yksityiskohtia. Tutkimuksessa analysoitiin aineistoa kahdesta koejärjestelystä. Ensimmäisessä kokeessa tallennettiin samanaikaisesti äänisignaali, elektroglottogrammi ja suurnopeuskuvamateriaalia äänihuulista koehenkilöiden tuottaessa pitkiä vokaaleita. Käänteissuodatuksella saaduista glottisvirtausestimaateista sekä kuvamateriaalin ilmaisemasta ääniraon pinta-alavaihtelusta laskettiin pulssiparametreja, joiden avulla vertailtiin virtauksen ja ääniraon pinta-alan käyttäytymistä. Toisen koejärjestelyn aineisto koostui äänisignaalista ja elektroglottogrammista, jotka oli tallennettu vokaaliääntöjen aikana. Tämän materiaalin perusteella analysoitiin ääniraon avautumis- ja sulkeutumisvaihetta. Tarkastellun käänteissuodatusmenetelmän todettiin tuottavan enimmäkseen luotettavia virtausestimaatteja edellyttäen, että menetelmän parametrit asetetaan tarkoituksenmukaisesti, mikä vaatii käyttäjältä kokemusta käänteissuodatuksesta ja ihmisen puheentuotosta. Glottisvirtauksen avautumisvaiheen havaittiin olevan useissa virtausestimaateissa kaksivaiheinen siten, että virtauksen kasvu voimistuu nopeasti kahdessa kohdassa sulkeutumisen ja maksimivirtauksen välillä. Virtauksen kasvun todettiin usein voimistuvan elektroglottogrammista tunnistetun ääniraon avautumishetken lähellä. Virtauksen derivaatan minimikohdan havaittiin sijoittuvan enimmäkseen virtauksen sulkeutumisvaiheen viimeiseen neljännekseen, ja sen osoitettiin esiintyvän ennen elektroglottogrammin derivaatan minimikohtaa

    Glottal Parameter Estimation by Wavelet Transform for Voice Biometry

    Get PDF
    Voice biometry is classically based on the parameterization and patterning of speech features mainly. The present approach is based on the characterization of phonation features instead (glottal features). The intention is to reduce intra-speaker variability due to the `text'. Through the study of larynx biomechanics it may be seen that the glottal correlates constitute a family of 2-nd order gaussian wavelets. The methodology relies in the extraction of glottal correlates (the glottal source) which are parameterized using wavelet techniques. Classification and pattern matching was carried out using Gaussian Mixture Models. Data of speakers from a balanced database and NIST SRE HASR2 were used in verification experiments. Preliminary results are given and discussed

    Voice source characterization for prosodic and spectral manipulation

    Get PDF
    The objective of this dissertation is to study and develop techniques to decompose the speech signal into its two main components: voice source and vocal tract. Our main efforts are on the glottal pulse analysis and characterization. We want to explore the utility of this model in different areas of speech processing: speech synthesis, voice conversion or emotion detection among others. Thus, we will study different techniques for prosodic and spectral manipulation. One of our requirements is that the methods should be robust enough to work with the large databases typical of speech synthesis. We use a speech production model in which the glottal flow produced by the vibrating vocal folds goes through the vocal (and nasal) tract cavities and its radiated by the lips. Removing the effect of the vocal tract from the speech signal to obtain the glottal pulse is known as inverse filtering. We use a parametric model fo the glottal pulse directly in the source-filter decomposition phase. In order to validate the accuracy of the parametrization algorithm, we designed a synthetic corpus using LF glottal parameters reported in the literature, complemented with our own results from the vowel database. The results show that our method gives satisfactory results in a wide range of glottal configurations and at different levels of SNR. Our method using the whitened residual compared favorably to this reference, achieving high quality ratings (Good-Excellent). Our full parametrized system scored lower than the other two ranking in third place, but still higher than the acceptance threshold (Fair-Good). Next we proposed two methods for prosody modification, one for each of the residual representations explained above. The first method used our full parametrization system and frame interpolation to perform the desired changes in pitch and duration. The second method used resampling on the residual waveform and a frame selection technique to generate a new sequence of frames to be synthesized. The results showed that both methods are rated similarly (Fair-Good) and that more work is needed in order to achieve quality levels similar to the reference methods. As part of this dissertation, we have studied the application of our models in three different areas: voice conversion, voice quality analysis and emotion recognition. We have included our speech production model in a reference voice conversion system, to evaluate the impact of our parametrization in this task. The results showed that the evaluators preferred our method over the original one, rating it with a higher score in the MOS scale. To study the voice quality, we recorded a small database consisting of isolated, sustained Spanish vowels in four different phonations (modal, rough, creaky and falsetto) and were later also used in our study of voice quality. Comparing the results with those reported in the literature, we found them to generally agree with previous findings. Some differences existed, but they could be attributed to the difficulties in comparing voice qualities produced by different speakers. At the same time we conducted experiments in the field of voice quality identification, with very good results. We have also evaluated the performance of an automatic emotion classifier based on GMM using glottal measures. For each emotion, we have trained an specific model using different features, comparing our parametrization to a baseline system using spectral and prosodic characteristics. The results of the test were very satisfactory, showing a relative error reduction of more than 20% with respect to the baseline system. The accuracy of the different emotions detection was also high, improving the results of previously reported works using the same database. Overall, we can conclude that the glottal source parameters extracted using our algorithm have a positive impact in the field of automatic emotion classification

    Exposing the hidden vocal channel: Analysis of vocal expression

    Get PDF
    This dissertation explored perception and modeling of human vocal expression, and began by asking what people heard in expressive speech. To address this fundamental question, clips from Shakespearian soliloquy and from the Library of Congress Veterans Oral History Collection were presented to Mechanical Turk workers (10 per clip); and the workers were asked to provide 1-3 keywords describing the vocal expression in the voice. The resulting keywords described prosody, voice quality, nonverbal quality, and emotion in the voice, along with the conversational style, and personal qualities attributed to the speaker. More than half of the keywords described emotion, and were wide-ranging and nuanced. In contrast, keywords describing prosody and voice quality reduced to a short list of frequently-repeating vocal elements. Given this description of perceived vocal expression, a 3-step process was used to model vocal qualities which listeners most frequently perceived. This process included 1) an interactive analysis across each condition to discover its distinguishing characteristics, 2) feature selection and evaluation via unequal variance sensitivity measurements and examination of means and 2-sigma variances across conditions, and 3) iterative, incremental classifier training and validation. The resulting models performed at 2-3.5 times chance. More importantly, the analysis revealed a continuum relationship across whispering, breathiness, modal speech, and resonance, and revealed multiple spectral sub-types of breathiness, modal speech, resonance, and creaky voice. Finally, latent semantic analysis (LSA) applied to the crowdsourced keyword descriptors enabled organic discovery of expressive dimensions present in each corpus, and revealed relationships among perceived voice qualities and emotions within each dimension and across the corpora. The resulting dimensional classifiers performed at up to 3 times chance, and a second study presented a dimensional analysis of laughter. This research produced a new way of exploring emotion in the voice, and of examining relationships among emotion, prosody, voice quality, conversation quality, personal quality, and other expressive vocal elements. For future work, this perception-grounded fusion of crowdsourcing and LSA technique can be applied to anything humans can describe, in any research domain

    Hidden Markov model based Finnish text-to-speech system utilizing glottal inverse filtering

    Get PDF
    Tässä työssä esitetään uusi Markovin piilomalleihin (hidden Markov model, HMM) perustuva äänilähteen käänteissuodatusta hyödyntävä suomenkielinen puhesynteesijärjestelmä. Uuden puhesynteesimenetelmän päätavoite on tuottaa luonnolliselta kuulostavaa synteettistä puhetta, jonka ominaisuuksia voidaan muuttaa eri puhujien, puhetyylien tai jopa äänen emootiosisällön mukaan. Näiden tavoitteiden mahdollistamiseksi uudessa puhesynteesimenetelmässä mallinnetaan ihmisen äänentuottojärjestelmää äänilähteen käänteissuodatuksen ja HMM-mallinnuksen avulla. Uusi puhesynteesijärjestelmä hyödyntää äänilähteen käänteissuodatusmenetelmää, joka mahdollistaa äänilähteen ominaisuuksien parametrisoinnin erillään muista puheen parametreista, ja siten näiden parametrien mallintamisen erikseen HMM-järjestelmässä. Synteesivaiheessa luonnollisesta puheesta laskettuja glottispulsseja käytetään äänilähteen luomiseen, ja äänilähteen ominaisuuksia muokataan edelleen tilastollisen HMM-järjestelmän tuottaman parametrisen kuvauksen avulla, mikä imitoi oikeassa puheessa esiintyvää luonnollista äänilähteen ominaisuuksien vaihtelua. Subjektiivisten kuuntelukokeiden tulokset osoittavat, että uuden puhesynteesimenetelmän laatu on huomattavasti parempi verrattuna perinteiseen HMM-pohjaiseen puhesynteesijärjestelmään. Lisäksi tulokset osoittavat, että uusi puhesynteesimenetelmä pystyy tuottamaan luonnolliselta kuulostavaa puhetta eri puhujien ominaisuuksilla.In this work, a new hidden Markov model (HMM) based text-to-speech (TTS) system utilizing glottal inverse filtering is described. The primary goal of the new TTS system is to enable producing natural sounding synthetic speech in different speaking styles with different speaker characteristics and emotions. In order to achieve these goals, the function of the real human voice production mechanism is modeled with the help of glottal inverse filtering embedded in a statistical framework of HMM. The new TTS system uses a glottal inverse filtering based parametrization method that enables the extraction of voice source characteristics separate from other speech parameters, and thus the individual modeling of these characteristics in the HMM system. In the synthesis stage, natural glottal flow pulses are used for creating the voice source, and the voice source characteristics are further modified according to the adaptive all-pole model generated by the HMM system in order to imitate the natural variation in the real voice source. Subjective listening tests show that the quality of the new TTS system is considerably better compared to a traditional HMM-based speech synthesizer. Moreover, the new system is clearly able to produce natural sounding synthetic speech with specific speaker characteristics

    Voice quality features in the production of pharyngeal consonants by Iraqi Arabic speakers

    Get PDF
    PhD ThesisThis study investigates nasalisation and laryngealisation in the production of pharyngeal consonants in Iraqi Arabic (IA) and as potential voice quality (VQ) settings of IA speakers in general. Pharyngeal consonants have been the subject of investigation in many studies on Arabic, primarily due to the wide range of variation in their realisation across dialects, including approximant, fricative, and stop variants. This is the first quantitative study of its kind to extend these findings to IA and to investigate whether any of the variants and/or VQ features are dialect- specific. The study offers a detailed auditory and acoustic account of the realisations of pharyngeal consonants as produced by nine male speakers of three Iraqi dialects: Baghdad (representing Central gelet), Basra (representing Southern gelet) and Mosul (representing Northern qeltu) (Blanc, 1964; Ingham, 1997). Acoustic cues of nasalisation and phonation types are investigated in isolated vowels, oral, nasal, and pharyngeal environments in order to unravel the source of the nasalised and laryngealised VQ percept and to establish whether their manifestations are categorical or particular to certain contexts. Results suggest a range of realisations for the pharyngeals that are conditioned by word position and dialect. Regardless of realisation, VQ measurements suggest that: 1- nasalisation increases when pharyngeals are adjacent to nasals, beyond what is expected of a nasal environment; 2- vowels neighbouring pharyngeals show more nasalisation than in oral environments; 3- vowels in pharyngeal contexts and isolation show more laryngealisation compared with nasal and oral contexts; 4- both nasals and pharyngeals show progressive effect of nasalisation, and pharyngeals show a progressive effect of laryngealisation; 5- /ħ/ shows more nasalisation but less laryngealisation effect on neighbouring vowels than /ʕ/; and 6- Baghdad speech is the most nasalised and laryngealised and Basra speech the least. These results coincide with observations on Muslim Baghdadi gelet having a guttural quality (Bellem, 2007). The study reveals that the overall percept of a nasalised and laryngealised VQ in IA is a local feature rather than a general vocal setting
    corecore