119 research outputs found
Exploring the contribution of voice quality to the perception of gender in Scottish English
This study investigates how voice quality, here phonation, affects listener perception of speaker gender, and how voice quality interacts with pitch, a major cue to speaker gender, when cueing gender perceptions. Gender differences in voice quality have been identified in both Scottish (Beck and Schaeffler 2015; Stuart-Smith 1999) and American English (Abdelli-Beruh et al. 2014; D. Klatt and L. Klatt 1990; Podesva 2013; Syrdal 1996; Wolk et al. 2012; Yuasa 2010). There is evidence from previous research that suggest gender differences in voice quality may also influence listener perception of speaker gender, with breathy voice being perceived as feminine or female characteristic by listeners (Addington 1968; Andrews and Schmidt 1997; Bishop and Keating 2012; Holmberg et al. 2010; Porter 2012; Skuk and Schweinberger 2014; Van Borsel et al. 2009) and creaky voice being perceived as masculine characteristic (Greer 2015; Lee 2016). However, some studies have found that voice quality has little effect (Booz and Ferguson 2016; King et al. 2012; Owen and Hancock 2010). The present study seeks to investigate the contribution of voice quality, taking into account the various methods of producing voice quality differences in stimuli, cultural differences in gendered meanings of voice quality, and different methods of quantifying ‘perceived gender’, which may contribute to the conflicting results of previous studies.
To investigate the contribution of voice quality to perceptions of speaker gender, a perception experiment was be carried out where 32 Scottish listeners and 40 North American listeners heard stimuli with different voice qualities (modal, breathy, creaky) and at different pitch levels (120Hz, 165Hz, 210Hz), and were asked to make judgements about the gender of the speaker. Differences in voice quality were produced by a speaker with the ability to create voice quality distinctions, as well as created through copy synthesis from the speaker’s voice. Listeners were asked to indicate whether they thought the voice belonged to a man or a woman and rate how masculine and feminine the voice sounded. Relative to modal voice, I predicted that listeners would be more likely to categorise breathy voices as women, and would rate them as more feminine and less masculine, and that listeners would be less likely to categorise creaky voices as women, and would rate them as more masculine and less feminine. I also predicted that there might be differences in how Scottish listeners and North American listeners perceived voice quality, given that the gender differences in voice quality in these two varieties of English have been found to differ in previous research.
Consistent with my predictions, I found that relative to modal voice, listeners were more likely to categorise breathy voice stimuli as women, and rated breathy voice stimuli as more feminine and less masculine. However, in contrast with my predictions, I found that relative to modal voice, listeners were more likely to categorise creaky voice stimuli as women, and rated them as less masculine, but not more feminine. Furthermore, contrary to predictions, I did not identify differences between Scottish and North American listeners in terms of voice quality perception. Differences were also found in how breathy and creaky voice influence gender perception at different pitch levels.
Overall, these results show that voice quality has an important influence on listener perception of speaker gender, and that the gendered meanings of creaky voice are changing and have disassociated from its low pitch. Future research should consider whether this evaluation among Scottish listeners this may reflect a wider change in the gender differences in production
Recommended from our members
A novel framework for high-quality voice source analysis and synthesis
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.The analysis, parameterization and modeling of voice source estimates obtained via inverse filtering of recorded speech are some of the most challenging areas of speech processing owing to the fact humans produce a wide range of voice source realizations and that the voice source estimates commonly contain artifacts due to the non-linear time-varying source-filter coupling. Currently, the most widely adopted representation of voice source signal is Liljencrants-Fant's (LF) model which was developed in late 1985. Due to the overly simplistic interpretation of voice source dynamics, LF model can not represent the fine temporal structure of glottal flow derivative realizations nor can it carry the sufficient spectral richness to facilitate a truly natural sounding speech synthesis. In this thesis we have introduced Characteristic Glottal Pulse Waveform Parameterization and Modeling (CGPWPM) which constitutes an entirely novel framework for voice source analysis, parameterization and reconstruction. In comparative evaluation of CGPWPM and LF model we have demonstrated that the proposed method is able to preserve higher levels of speaker dependant information from the voice source estimates and realize a more natural sounding speech synthesis. In general, we have shown that CGPWPM-based speech synthesis rates highly on the scale of absolute perceptual acceptability and that speech signals are faithfully reconstructed on consistent basis, across speakers, gender. We have applied CGPWPM to voice quality profiling and text-independent voice quality conversion method. The proposed voice conversion method is able to achieve the desired perceptual effects and the modified
speech remained as natural sounding and intelligible as natural speech. In this thesis, we have also developed an optimal wavelet thresholding strategy for voice source signals which is able to suppress aspiration noise and still retain both the slow and the rapid variations in the voice source estimate
On the use of voice descriptors for glottal source shape parameter estimation
International audienceThis paper summarizes the results of our investigations into estimating the shape of the glottal excitation source from speech signals. We employ the Liljencrants-Fant (LF) model describing the glottal flow and its derivative. The one-dimensional glottal source shape parameter Rd describes the transition in voice quality from a tense to a breathy voice. The parameter Rd has been derived from a statistical regression of the R waveshape parameters which parameterize the LF model. First, we introduce a variant of our recently proposed adaptation and range extension of the Rd parameter regression. Secondly, we discuss in detail the aspects of estimating the glottal source shape parameter Rd using the phase minimization paradigm. Based on the analysis of a large number of speech signals we describe the major conditions that are likely to result in erroneous Rd estimates. Based on these findings we investigate into means to increase the robustness of the Rd parameter estimation. We use Viterbi smoothing to suppress unnatural jumps of the estimated Rd parameter contours within short time segments. Additionally, we propose to steer the Viterbi algorithm by exploiting the covariation of other voice descriptors to improve Viterbi smoothing. The novel Viterbi steering is based on a Gaussian Mixture Model (GMM) that represents the joint density of the voice descriptors and the Open Quotient (OQ) estimated from corresponding electroglottographic (EGG) signals. A conversion function derived from the mixture model predicts OQ from the voice descriptors. Converted to Rd it defines an additional prior probability to adapt the partial probabilities of the Viterbi algorithm accordingly. Finally, we evaluate the performances of the phase minimization based methods using both variants to adapt and extent the Rd regression on one synthetic test set as well as in combination with Viterbi smoothing and each variant of the novel Viterbi steering on one test set of natural speech. The experimental findings exhibit improvements for both Viterbi approaches
Comparison of laryngoscopic, glottal and vibratory parameters among Estill qualities – Case study
Estill Voice Training (EVT) is an effective educational system for developing and controlling distinct voice qualities used in contemporary commercial singing. EVT teaches six vocal qualities that differ at 13 levels. This study aims to investigate whether the distinct vocal qualities taught by EVT can be systematically differentiated based on laryngoscopic observations and vocal fold oscillation parameters. To investigate the differences in six EVT qualities, laryngeal dimensions and glottal area waveform parameters were measured in a single female subject who performed it in one-octave scale. Glottis Analysis Tools (GAT) were used to measure these parameters and phonovibrograms were obtained from the analysis. The resulting data were subjected to factor analysis to identify the systematic differences between EVT qualities. High-speed videolaryngoscopy analysis revealed a significant influence of vocal qualities on vocal fold oscillations. The factor analysis of the data identified three factors based on laryngeal dimension and four factors derived from GAT parameters. The first GAT factor was influenced by posterior adduction and distinguished belt quality from other qualities, suggesting a significant influence of the aryepiglottic sphincter. The second GAT factor contained parameters derived from glottal length and amplitude, suggesting a relationship not only with vocal registers but also with laryngeal height. The third GAT factor was best related to body-cover figure and phonation type (membranous medialization), while the fourth GAT factor was related to the amplitude-length-ratio. These findings suggest that vocal fold oscillations can be used to distinguish between Estill voice qualities
Ihmisen äänentuoton analysointi käänteissuodatuksen, suurnopeuskuvauksen ja elektroglottografian avulla
Human voice production was studied using three methods: inverse filtering, digital high-speed imaging of the vocal folds, and electroglottography. The primary goal was to evaluate an inverse filtering method by comparing inverse filtered glottal flow estimates with information obtained by the other methods. More detailed examination of the human voice source behavior was also included in the work.
Material from two experiments was analyzed in this study. The data of the first experiment consisted of simultaneous recordings of acoustic speech signal, electroglottogram, and high-speed imaging acquired during sustained vowel phonations. Inverse filtered glottal flow estimates were compared with glottal area waveforms derived from the image material by calculating pulse shape parameters from the signals. The material of the second experiment included recordings of acoustic speech signal and electroglottogram during phonations of sustained vowels. This material was utilized for the analysis of the opening phase and the closing phase of vocal fold vibration.
The evaluated inverse filtering method was found to produce mostly reasonable estimates of glottal flow. However, the parameters of the system have to be set appropriately, which requires experience on inverse filtering and speech production. The flow estimates often showed a two-stage opening phase with two instants of rapid increase in the flow derivative. The instant of glottal opening detected in the electroglottogram was often found to coincide with an increase in the flow derivative. The instant of minimum flow derivative was found to occur mostly during the last quarter of the closing phase and it was shown to precede the closing peak of the differentiated electroglottogram.Ihmisen puheentuottoa tutkittiin kolmella menetelmällä: käänteissuodatuksella, äänihuulten digitaalisella suurnopeuskuvauksella ja elektroglottografialla. Päätavoitteena oli tarkastella erään käänteissuodatusmenetelmän toimintaa vertailemalla näillä menetelmillä saatua informaatiota äänihuulten värähtelystä. Lisäksi tutkittiin tarkemmin eräitä äänilähteen käyttäytymisen yksityiskohtia.
Tutkimuksessa analysoitiin aineistoa kahdesta koejärjestelystä. Ensimmäisessä kokeessa tallennettiin samanaikaisesti äänisignaali, elektroglottogrammi ja suurnopeuskuvamateriaalia äänihuulista koehenkilöiden tuottaessa pitkiä vokaaleita. Käänteissuodatuksella saaduista glottisvirtausestimaateista sekä kuvamateriaalin ilmaisemasta ääniraon pinta-alavaihtelusta laskettiin pulssiparametreja, joiden avulla vertailtiin virtauksen ja ääniraon pinta-alan käyttäytymistä. Toisen koejärjestelyn aineisto koostui äänisignaalista ja elektroglottogrammista, jotka oli tallennettu vokaaliääntöjen aikana. Tämän materiaalin perusteella analysoitiin ääniraon avautumis- ja sulkeutumisvaihetta.
Tarkastellun käänteissuodatusmenetelmän todettiin tuottavan enimmäkseen luotettavia virtausestimaatteja edellyttäen, että menetelmän parametrit asetetaan tarkoituksenmukaisesti, mikä vaatii käyttäjältä kokemusta käänteissuodatuksesta ja ihmisen puheentuotosta. Glottisvirtauksen avautumisvaiheen havaittiin olevan useissa virtausestimaateissa kaksivaiheinen siten, että virtauksen kasvu voimistuu nopeasti kahdessa kohdassa sulkeutumisen ja maksimivirtauksen välillä. Virtauksen kasvun todettiin usein voimistuvan elektroglottogrammista tunnistetun ääniraon avautumishetken lähellä. Virtauksen derivaatan minimikohdan havaittiin sijoittuvan enimmäkseen virtauksen sulkeutumisvaiheen viimeiseen neljännekseen, ja sen osoitettiin esiintyvän ennen elektroglottogrammin derivaatan minimikohtaa
Glottal Parameter Estimation by Wavelet Transform for Voice Biometry
Voice biometry is classically based on the parameterization and patterning of speech features mainly. The present approach is based on the characterization of phonation features instead (glottal features). The intention is to reduce intra-speaker variability due to the `text'. Through the study of larynx biomechanics it may be seen that the glottal correlates constitute a family of 2-nd order gaussian wavelets. The methodology relies in the extraction of glottal correlates (the glottal source) which are parameterized using wavelet techniques. Classification and pattern matching was carried out using Gaussian Mixture Models. Data of speakers from a balanced database and NIST SRE HASR2 were used in verification experiments. Preliminary results are given and discussed
Voice source characterization for prosodic and spectral manipulation
The objective of this dissertation is to study and develop techniques to decompose the speech signal into its two main
components: voice source and vocal tract. Our main efforts are on the glottal pulse analysis and characterization. We want to
explore the utility of this model in different areas of speech processing: speech synthesis, voice conversion or emotion detection
among others. Thus, we will study different techniques for prosodic and spectral manipulation. One of our requirements is that
the methods should be robust enough to work with the large databases typical of speech synthesis. We use a speech production
model in which the glottal flow produced by the vibrating vocal folds goes through the vocal (and nasal) tract cavities and its
radiated by the lips. Removing the effect of the vocal tract from the speech signal to obtain the glottal pulse is known as inverse
filtering. We use a parametric model fo the glottal pulse directly in the source-filter decomposition phase.
In order to validate the accuracy of the parametrization algorithm, we designed a synthetic corpus using LF glottal parameters
reported in the literature, complemented with our own results from the vowel database. The results show that our method gives
satisfactory results in a wide range of glottal configurations and at different levels of SNR. Our method using the whitened
residual compared favorably to this reference, achieving high quality ratings (Good-Excellent). Our full parametrized system
scored lower than the other two ranking in third place, but still higher than the acceptance threshold (Fair-Good).
Next we proposed two methods for prosody modification, one for each of the residual representations explained above. The first
method used our full parametrization system and frame interpolation to perform the desired changes in pitch and duration. The
second method used resampling on the residual waveform and a frame selection technique to generate a new sequence of
frames to be synthesized. The results showed that both methods are rated similarly (Fair-Good) and that more work is needed in
order to achieve quality levels similar to the reference methods.
As part of this dissertation, we have studied the application of our models in three different areas: voice conversion, voice quality
analysis and emotion recognition. We have included our speech production model in a reference voice conversion system, to
evaluate the impact of our parametrization in this task. The results showed that the evaluators preferred our method over the
original one, rating it with a higher score in the MOS scale. To study the voice quality, we recorded a small database consisting of
isolated, sustained Spanish vowels in four different phonations (modal, rough, creaky and falsetto) and were later also used in
our study of voice quality. Comparing the results with those reported in the literature, we found them to generally agree with
previous findings. Some differences existed, but they could be attributed to the difficulties in comparing voice qualities produced
by different speakers. At the same time we conducted experiments in the field of voice quality identification, with very good
results. We have also evaluated the performance of an automatic emotion classifier based on GMM using glottal measures. For
each emotion, we have trained an specific model using different features, comparing our parametrization to a baseline system
using spectral and prosodic characteristics. The results of the test were very satisfactory, showing a relative error reduction of
more than 20% with respect to the baseline system. The accuracy of the different emotions detection was also high, improving
the results of previously reported works using the same database. Overall, we can conclude that the glottal source parameters
extracted using our algorithm have a positive impact in the field of automatic emotion classification
Exposing the hidden vocal channel: Analysis of vocal expression
This dissertation explored perception and modeling of human vocal expression, and began by asking what people heard in expressive speech. To address this fundamental question, clips from Shakespearian soliloquy and from the Library of Congress Veterans Oral History Collection were presented to Mechanical Turk workers (10 per clip); and the workers were asked to provide 1-3 keywords describing the vocal expression in the voice. The resulting keywords described prosody, voice quality, nonverbal quality, and emotion in the voice, along with the conversational style, and personal qualities attributed to the speaker. More than half of the keywords described emotion, and were wide-ranging and nuanced. In contrast, keywords describing prosody and voice quality reduced to a short list of frequently-repeating vocal elements. Given this description of perceived vocal expression, a 3-step process was used to model vocal qualities which listeners most frequently perceived. This process included 1) an interactive analysis across each condition to discover its distinguishing characteristics, 2) feature selection and evaluation via unequal variance sensitivity measurements and examination of means and 2-sigma variances across conditions, and 3) iterative, incremental classifier training and validation. The resulting models performed at 2-3.5 times chance. More importantly, the analysis revealed a continuum relationship across whispering, breathiness, modal speech, and resonance, and revealed multiple spectral sub-types of breathiness, modal speech, resonance, and creaky voice. Finally, latent semantic analysis (LSA) applied to the crowdsourced keyword descriptors enabled organic discovery of expressive dimensions present in each corpus, and revealed relationships among perceived voice qualities and emotions within each dimension and across the corpora. The resulting dimensional classifiers performed at up to 3 times chance, and a second study presented a dimensional analysis of laughter. This research produced a new way of exploring emotion in the voice, and of examining relationships among emotion, prosody, voice quality, conversation quality, personal quality, and other expressive vocal elements. For future work, this perception-grounded fusion of crowdsourcing and LSA technique can be applied to anything humans can describe, in any research domain
Hidden Markov model based Finnish text-to-speech system utilizing glottal inverse filtering
Tässä työssä esitetään uusi Markovin piilomalleihin (hidden Markov model, HMM) perustuva äänilähteen käänteissuodatusta hyödyntävä suomenkielinen puhesynteesijärjestelmä. Uuden puhesynteesimenetelmän päätavoite on tuottaa luonnolliselta kuulostavaa synteettistä puhetta, jonka ominaisuuksia voidaan muuttaa eri puhujien, puhetyylien tai jopa äänen emootiosisällön mukaan. Näiden tavoitteiden mahdollistamiseksi uudessa puhesynteesimenetelmässä mallinnetaan ihmisen äänentuottojärjestelmää äänilähteen käänteissuodatuksen ja HMM-mallinnuksen avulla.
Uusi puhesynteesijärjestelmä hyödyntää äänilähteen käänteissuodatusmenetelmää, joka mahdollistaa äänilähteen ominaisuuksien parametrisoinnin erillään muista puheen parametreista, ja siten näiden parametrien mallintamisen erikseen HMM-järjestelmässä. Synteesivaiheessa luonnollisesta puheesta laskettuja glottispulsseja käytetään äänilähteen luomiseen, ja äänilähteen ominaisuuksia muokataan edelleen tilastollisen HMM-järjestelmän tuottaman parametrisen kuvauksen avulla, mikä imitoi oikeassa puheessa esiintyvää luonnollista äänilähteen ominaisuuksien vaihtelua.
Subjektiivisten kuuntelukokeiden tulokset osoittavat, että uuden puhesynteesimenetelmän laatu on huomattavasti parempi verrattuna perinteiseen HMM-pohjaiseen puhesynteesijärjestelmään. Lisäksi tulokset osoittavat, että uusi puhesynteesimenetelmä pystyy tuottamaan luonnolliselta kuulostavaa puhetta eri puhujien ominaisuuksilla.In this work, a new hidden Markov model (HMM) based text-to-speech (TTS) system utilizing glottal inverse filtering is described. The primary goal of the new TTS system is to enable producing natural sounding synthetic speech in different speaking styles with different speaker characteristics and emotions. In order to achieve these goals, the function of the real human voice production mechanism is modeled with the help of glottal inverse filtering embedded in a statistical framework of HMM.
The new TTS system uses a glottal inverse filtering based parametrization method that enables the extraction of voice source characteristics separate from other speech parameters, and thus the individual modeling of these characteristics in the HMM system. In the synthesis stage, natural glottal flow pulses are used for creating the voice source, and the voice source characteristics are further modified according to the adaptive all-pole model generated by the HMM system in order to imitate the natural variation in the real voice source.
Subjective listening tests show that the quality of the new TTS system is considerably better compared to a traditional HMM-based speech synthesizer. Moreover, the new system is clearly able to produce natural sounding synthetic speech with specific speaker characteristics
Voice quality features in the production of pharyngeal consonants by Iraqi Arabic speakers
PhD ThesisThis study investigates nasalisation and laryngealisation in the production of pharyngeal
consonants in Iraqi Arabic (IA) and as potential voice quality (VQ) settings of IA speakers in
general. Pharyngeal consonants have been the subject of investigation in many studies on Arabic,
primarily due to the wide range of variation in their realisation across dialects, including
approximant, fricative, and stop variants. This is the first quantitative study of its kind to extend
these findings to IA and to investigate whether any of the variants and/or VQ features are dialect-
specific.
The study offers a detailed auditory and acoustic account of the realisations of pharyngeal
consonants as produced by nine male speakers of three Iraqi dialects: Baghdad (representing
Central gelet), Basra (representing Southern gelet) and Mosul (representing Northern qeltu)
(Blanc, 1964; Ingham, 1997). Acoustic cues of nasalisation and phonation types are investigated
in isolated vowels, oral, nasal, and pharyngeal environments in order to unravel the source of the
nasalised and laryngealised VQ percept and to establish whether their manifestations are
categorical or particular to certain contexts.
Results suggest a range of realisations for the pharyngeals that are conditioned by word position
and dialect. Regardless of realisation, VQ measurements suggest that: 1- nasalisation increases
when pharyngeals are adjacent to nasals, beyond what is expected of a nasal environment; 2-
vowels neighbouring pharyngeals show more nasalisation than in oral environments; 3- vowels in
pharyngeal contexts and isolation show more laryngealisation compared with nasal and oral
contexts; 4- both nasals and pharyngeals show progressive effect of nasalisation, and pharyngeals
show a progressive effect of laryngealisation; 5- /ħ/ shows more nasalisation but less
laryngealisation effect on neighbouring vowels than /ʕ/; and 6- Baghdad speech is the most
nasalised and laryngealised and Basra speech the least. These results coincide with observations
on Muslim Baghdadi gelet having a guttural quality (Bellem, 2007). The study reveals that the
overall percept of a nasalised and laryngealised VQ in IA is a local feature rather than a general
vocal setting
- …