    Changes in Aerodynamic Function and Closed Quotient with the Variable Pitch and Loudness in Male Classic Singers

    This study examined the aerodynamic functions (mean airflow rate MFR, subglottal pressure Psub) and closed quotients (CQs) in the fixed pitch (C3, E3, G3, C4) with the variable loudness (70 and 80 dB) as well as in the fixed loudness at 70 dB and 80 dB with the variable pitch (C3, E3, G3, C4) in five classic male singers (Baritone). Results showed that MFR significantly increased at C3, E3, and G3 and Psub significantly increased at C4 when the loudness increased from 70 to 80 dB. At 70 dB, MFR and Psub significantly increased and CQ significantly decreased when the pitch increased from C3 to C4. At 80 dB, MFR significantly decreased when the pitch increased from C3 to G3. However, Psub showed the significant decrease with the pitch increased at 80 dB. In conclusion, as the loudness increases, the aerodynamic loss is getting higher and vocal efficiency becomes lower at low pitch than at higher pitch. At a low loudness level, the main mechanism to control loudness is the amount of medial compression of the vocal folds rather than the aerodynamic function. In addition, the aerodynamic function and medial compression of the vocal folds have a significant role in increasing the loudness level.ope

    On the use of voice descriptors for glottal source shape parameter estimation

    International audienceThis paper summarizes the results of our investigations into estimating the shape of the glottal excitation source from speech signals. We employ the Liljencrants-Fant (LF) model describing the glottal flow and its derivative. The one-dimensional glottal source shape parameter Rd describes the transition in voice quality from a tense to a breathy voice. The parameter Rd has been derived from a statistical regression of the R waveshape parameters which parameterize the LF model. First, we introduce a variant of our recently proposed adaptation and range extension of the Rd parameter regression. Secondly, we discuss in detail the aspects of estimating the glottal source shape parameter Rd using the phase minimization paradigm. Based on the analysis of a large number of speech signals we describe the major conditions that are likely to result in erroneous Rd estimates. Based on these findings we investigate into means to increase the robustness of the Rd parameter estimation. We use Viterbi smoothing to suppress unnatural jumps of the estimated Rd parameter contours within short time segments. Additionally, we propose to steer the Viterbi algorithm by exploiting the covariation of other voice descriptors to improve Viterbi smoothing. The novel Viterbi steering is based on a Gaussian Mixture Model (GMM) that represents the joint density of the voice descriptors and the Open Quotient (OQ) estimated from corresponding electroglottographic (EGG) signals. A conversion function derived from the mixture model predicts OQ from the voice descriptors. Converted to Rd it defines an additional prior probability to adapt the partial probabilities of the Viterbi algorithm accordingly. Finally, we evaluate the performances of the phase minimization based methods using both variants to adapt and extent the Rd regression on one synthetic test set as well as in combination with Viterbi smoothing and each variant of the novel Viterbi steering on one test set of natural speech. The experimental findings exhibit improvements for both Viterbi approaches

    Phonetic characteristics of persuasion in Finnish speech culture

    Tavoitteet: Tutkimuksen tavoitteena on selvittää äänenlaadun ja vakuuttavuuden korrelaatiosuhdetta esiintymistilanteessa. Tutkimuksessa tarkkaillaan myös äänenlaadun ja puhujasta havaittavien henkilömielikuvien mahdollista korrelaatiota suomalaisessa puhekulttuurissa. Suomalaisen puhekulttuurin piirteitä ei ole tutkittu äänenlaadulla laaja-alaisesti. Äänenlaadun tiedetään välittävän tärkeää tietoa puhujasta, mistä johtuen se on otettu tarkasteltavaksi vakuuttavuuden piirteeksi tässä tutkimuksessa. Vakuuttavuuden tutkiminen äänenlaadulla ja foneettisilla tutkimusmenetelmillä tarjoaa uutta tietoa fonetiikan tutkimusalan lisäksi muun muassa puheviestinnän ja kulttuuritutkimuksen aloille. Menetelmät: Tutkimus tehtiin kvalitatiivisilla menetelmillä. Tutkimusaineisto muodostui 12 suomalaisen puhujan spontaanista puhedatasta sekä puheen havaintoarvioista. Puhedataa tarkasteltiin auditiivisesti havaintokokeella ja akustisesti käänteissuodatusmenetelmällä. Havaintokokeessa kuuntelijat arvioivat puhujien äänenlaatua, koulutustaustaa, henkilöpiirteitä sekä vakuuttavuutta. Vakuuttavuutta mitattiin Likert-asteikolla. Havaintoarviot kerättiin e-lomakkeella 30 kuuntelijalta, jotka olivat kasvaneet suomalaisessa puhekulttuurissa. Havaintokokeeseen valittiin eri ikäisiä, sukupuolisia ja koulutustaustaisia kuuntelijoita. Spontaanista puheesta tutkittiin äänenlaatuja myös akustisesti Aalto Aparat -käänteissuodatusohjelmalla. Äänenlaadun tutkittava akustinen parametri oli normalisoitu amplitudiarvo (Normalized Amplitude Quotient, NAQ), jota verrattiin puhujien vakuuttavuudesta tehtyihin Likert-arvioihin. Vakuuttavuuden ja äänenlaadun akustisen parametrin suhdetta mitattiin tilastollisesti Kruskal-Wallis-testillä. Tutkimuksen auditiiviset ja akustiset analyysitulokset esitettiin tilastollisesti taulukoituina. Tulokset ja johtopäätökset: Suomalaisten puheesta havaittiin äänenlaadun piirteet: modaali, narina, käheys ja henkäyssointisuus. Suomalaisiin puhujiin yhdistettiin piirteet: aito, yksitoikkoinen, innostunut, luotettava, analyyttinen, uskottava ja tuttavallinen. Näiden kyseisten piirteiden esiintyvyyden voidaan katsoa heijastuvan suomalaiseen puhekulttuuriin, jossa arvostetaan luotettavuutta, selkeyttä ja aitoutta. Tulokset osoittivat, että tiukka äänentuottotapa yhdistettiin selkeään ja modaaliin äänenlaatuun, jolloin myös puhuja havaittiin vakuuttavaksi. Vakuuttavilla puhujilla esiintyi modaaliäänen lisäksi narinaääntä. Vaikka äänenlaadun akustinen parametri NAQ ei tutkimusaineiston perusteella korreloinut vakuuttavan puhujan äänen kanssa, tukevat akustiset analyysitulokset äänenlaadun roolia itsenäisenä piirteenä puheessa. Havaintoanalyysistä on saatu tärkeää tietoa vahvistamaan äänenlaadun roolia, kun puhujasta havaitaan eri piirteitä. Selkeää korrelaatiosuhdetta äänenlaadun ja vakuuttavuuden välille ei voida johtaa tämän tutkimuksen tuloksista johtuen tutkimusaineiston pienuudesta. Tulokset antavat kuitenkin viitteitä erilaisten äänenlaatujen käytöstä esiintymistilanteissa suomalaisessa puhekulttuurissa, joissa vakuuttavuus on päämääränä.Goals: The aim of this study was to examine a possible correlation between speaker’s voice quality and his/her persuasiveness in speech performance. The aim was also to detect what speaker characteristics would correlate with different voice qualities in Finnish speech culture. There has not been a lot of research done about Finnish speech culture from voice quality perspective, so any research that can be linked to speech culture offers more information for future studies. Voice quality has been studied for a long time and it is known that it has impact to the speech and can determine how it will be received by listeners. This is a good feature to examine and it helps to measure speaker’s convincingness. Phonetic methods give a new view of inspecting the phenomenon how voice quality can affect convincing speaker’s detection. Methods and material: This qualitative study consists of spontaneous speech samples by 12 different Finnish speakers and of 30 different auditory detection reviews. The data is examined using auditory and acoustic methods. The auditory analysis is based on the results from an auditory detection reviews made by 30 Finnish speaking listeners. This data is collected with e-form. The listeners were detecting and assessing voice quality traits, educational background, speakers’ characteristics and convincingness. Convincingness is measured using Likert scale. The voice quality in those spontaneous speeches is also measured using acoustical methods. Examination is done with glottal inverse filtering program called Aalto Aparat. The Voice quality parameter was normalized amplitude quotient (NAQ) and it was compared to the Likert scale evaluation results of convincing speaker. The correlation between these two parameters are calculated with Kruskal-Wallis test. All the results have been presented statistically. Results and conclusions: Voice qualities modal, creaky, hoarse and breathiness are more easily detected in Finnish speakers. The most detected speaker characteristics were genuine, monotonous, enthusiastic, trustworthy, analytical, credible and familiar. These characteristics can be found as good values in Finnish speech culture. The results show that tense voice production was associated with modal voice and with convincing speakers. Speakers who were evaluated to be convincing were also detected to have creak in their voice. Although the acoustical parameter NAQ did not show any correlation results to Likert scale assessments of convincing speaker based on this data, the acoustical results gave confirmation that voice quality is an independent voice characteristic. The auditory analysis gave important information on the interaction of speaker’s traits and voice quality. Based on this study and data, I could not find defining correlation between voice quality and a convincing speaker, but the results give example about how different voice qualities are detected in a speech that aims to be convincing in Finnish speech culture

    Äänen arviointi uusien riskiryhmien keskuudessa : tutkimuskohteena orgaaniselle pölylle altistuvat työntekijät, lastentarhan opettajat sekä lapset, joille on tehty kurkunpään leikkauksia

    Over the last decades, school teachers and singers have been more or less the focus of voice research, due to their specific occupational needs. Now, other population groups as well start to draw attention. These new groups include workers exposed to organic dusts in various workplaces with possible laryngeal reactions. The second group includes children operated on for subglottic stenosis with possible effects on voice and related quality of life. The third are nursery teachers insufficiently studied through field research for possible voice problems. This thesis aims to shed light on these newly emerging vulnerable groups in terms of assessing their voices through questionnaires, perceptual and acoustic voice assessment, and videolaryngoscopic examination. The thesis includes four studies. Nine subjects with suspected occupational rhinitis or asthma participated in Studies I and II. They had single blinded exposures to organic dust and placebo substances. Subjective and perceptual voice assessment was done in addition to acoustic analysis of 180 samples using glottal inverse filtering. In Study III, children s voices were perceptually assessed as well as their health- and voice-related quality of life. In Study IV, 119 female kindergarten teachers responded to a questionnaire on voice habits, voice symptoms, and impact of various working conditions on voice. In addition, videolaryngoscopy examinations took place in these teachers workplaces. Studies I and II showed that some self-assessed voice and throat symptoms changed significantly after organic dust exposure, although perceptual assessment failed to record these changes. However, inverse filtering analysis revealed changes that represent the ones reported by the subjects. In Study III, voice-related quality of life and perceptual assessment of the study group showed lower scores than the controls . Study IV showed that 71.5% of the teachers examined reported frequent strain on the voice. Organic findings were observed in 10.9% of the subjects and did not correlate with subjective voice symptoms. The thesis added new information on these high-risk groups, identifying an occupational voice-disorder risk group related to laryngeal reactions rather than voice abuse. It also added information on the long-term effects of surgery for subglottic stenosis in early infancy. Nevertheless, field videolaryngoscopy was quite accurate in determining the percentage of organic findings among nursery teachers.Ääni on ihmiskunnan tärkein kommunikaatioväline, jota käytetään jatkuvasti aamusta iltaan. Ääniammateissa toimii jopa kolmannes yhteiskunnan työvoimasta, kuten opettajat, laulajat ja lakimiehet. Viime vuosikymmeninä opettajat ja laulajat ovat olleet äänitutkimusten keskiössä, mutta on myös muita väestöryhmiä, jotka alkavat herättää tutkijoiden huomiota mahdollisista ääniongelmista kärsiessään. Tämän tutkimuksen lähtökohtana oli kiinnostus näitä uusia riskiryhmiä sekä heidän ääniongelmiensa arviointia kohtaan. Väitöskirjaa varten tutkittiin kolmea uutta riskiryhmää, jotka ovat: 1) työntekijät, joille orgaanisen pölyn altistus aiheuttaa ääniongelmia, 2) lapset, joille on tehty kurkunpään leikkauksia varhaislapsuudessa ääniraon alapuolisen ahtauman vuoksi, ja 3) lastentarhan opettajat, joita ei ole tutkittu riittävästi kenttätutkimuksessa eli heidän työpaikoillaan. Väitöskirjan tutkimuksissa käytettiin erilaisia äänen tutkimusmenetelmiä: tavallisia tutkimusmenetelmiä, kuten kyselylomakkeita, mutta myös harvinaisempia, osaltaan hyvin uusia tutkimusmenetelmiä, kuten akustista analyysia käänteissuodatuksella. Käänteissuodatuksessa mitattiin äänenmuodostumista äänihuulten tasolla. Kurkunpään tähystys tehtiin kannettavalla videojärjestelmällä lastentarhan opettajien työpaikoilla. Väitöskirja koostuu neljästä kansainvälisessä lehdessä julkaistusta artikkelista, joissa käsitellään kolmen aiemmin mainittujen riskiryhmien tutkimuksia. Orgaaniselle pölylle altistuvat työntekijät kärsivät mahdollisesti työpaikkaan liittyvästä äänihäiriöstä, joka johtuu kurkunpään reaktiosta, ei äänenkäytöstä. Työntekijät tiedostavat äänensä muuttuneen, mutta silti muutos jää huomaamatta kuulon aistivaraisessa tutkimuksessa eli lääkäreiden korvissa. Akustinen analyysi käänteissuodatuksen menetelmällä on todennut nämä muutokset. Korvakuulolla huomaamatta jäävien äänimuutosten toteaminen akustisesti on hyvin harvinainen ja lupaava tulos. Toinen tutkittu ryhmä olivat lapset, jotka on leikattu äänihuulten alapuolisen ahtauman vuoksi. Tutkimuksen tuloksena todettiin operatiivisen hoidon vaikuttavan pitkällä tähtäimellä äänen kuulon aistivaraiseen laatuun heikentävästi terveisiin lapsiin verrattuna. Tämän lisäksi ääneen liittyvä elämänlaatu on alhaisempi leikatuilla lapsilla. Kolmantena riskiryhmänä tutkimuksessa olivat lastentarhan opettajat. Tutkimus osoitti, että 71,5 % tutkituista opettajista ilmoitti kärsivänsä toistuvasta äänen rasituksesta. Tämän lisäksi äänihuulten orgaaniset muutokset todettiin 10.9 %:lla opettajista. Väitöskirjan tutkimus on tuottanut merkittävää tietoa uusista riskiryhmistä

    Voice source characterization for prosodic and spectral manipulation

    The objective of this dissertation is to study and develop techniques to decompose the speech signal into its two main components: voice source and vocal tract. Our main efforts are on the glottal pulse analysis and characterization. We want to explore the utility of this model in different areas of speech processing: speech synthesis, voice conversion or emotion detection among others. Thus, we will study different techniques for prosodic and spectral manipulation. One of our requirements is that the methods should be robust enough to work with the large databases typical of speech synthesis. We use a speech production model in which the glottal flow produced by the vibrating vocal folds goes through the vocal (and nasal) tract cavities and its radiated by the lips. Removing the effect of the vocal tract from the speech signal to obtain the glottal pulse is known as inverse filtering. We use a parametric model fo the glottal pulse directly in the source-filter decomposition phase. In order to validate the accuracy of the parametrization algorithm, we designed a synthetic corpus using LF glottal parameters reported in the literature, complemented with our own results from the vowel database. The results show that our method gives satisfactory results in a wide range of glottal configurations and at different levels of SNR. Our method using the whitened residual compared favorably to this reference, achieving high quality ratings (Good-Excellent). Our full parametrized system scored lower than the other two ranking in third place, but still higher than the acceptance threshold (Fair-Good). Next we proposed two methods for prosody modification, one for each of the residual representations explained above. The first method used our full parametrization system and frame interpolation to perform the desired changes in pitch and duration. The second method used resampling on the residual waveform and a frame selection technique to generate a new sequence of frames to be synthesized. The results showed that both methods are rated similarly (Fair-Good) and that more work is needed in order to achieve quality levels similar to the reference methods. As part of this dissertation, we have studied the application of our models in three different areas: voice conversion, voice quality analysis and emotion recognition. We have included our speech production model in a reference voice conversion system, to evaluate the impact of our parametrization in this task. The results showed that the evaluators preferred our method over the original one, rating it with a higher score in the MOS scale. To study the voice quality, we recorded a small database consisting of isolated, sustained Spanish vowels in four different phonations (modal, rough, creaky and falsetto) and were later also used in our study of voice quality. Comparing the results with those reported in the literature, we found them to generally agree with previous findings. Some differences existed, but they could be attributed to the difficulties in comparing voice qualities produced by different speakers. At the same time we conducted experiments in the field of voice quality identification, with very good results. We have also evaluated the performance of an automatic emotion classifier based on GMM using glottal measures. For each emotion, we have trained an specific model using different features, comparing our parametrization to a baseline system using spectral and prosodic characteristics. The results of the test were very satisfactory, showing a relative error reduction of more than 20% with respect to the baseline system. The accuracy of the different emotions detection was also high, improving the results of previously reported works using the same database. Overall, we can conclude that the glottal source parameters extracted using our algorithm have a positive impact in the field of automatic emotion classification

    A novel framework for high-quality voice source analysis and synthesis

    The analysis, parameterization and modeling of voice source estimates obtained via inverse filtering of recorded speech are some of the most challenging areas of speech processing owing to the fact humans produce a wide range of voice source realizations and that the voice source estimates commonly contain artifacts due to the non-linear time-varying source-filter coupling. Currently, the most widely adopted representation of voice source signal is Liljencrants-Fant's (LF) model which was developed in late 1985. Due to the overly simplistic interpretation of voice source dynamics, LF model can not represent the fine temporal structure of glottal flow derivative realizations nor can it carry the sufficient spectral richness to facilitate a truly natural sounding speech synthesis. In this thesis we have introduced Characteristic Glottal Pulse Waveform Parameterization and Modeling (CGPWPM) which constitutes an entirely novel framework for voice source analysis, parameterization and reconstruction. In comparative evaluation of CGPWPM and LF model we have demonstrated that the proposed method is able to preserve higher levels of speaker dependant information from the voice source estimates and realize a more natural sounding speech synthesis. In general, we have shown that CGPWPM-based speech synthesis rates highly on the scale of absolute perceptual acceptability and that speech signals are faithfully reconstructed on consistent basis, across speakers, gender. We have applied CGPWPM to voice quality profiling and text-independent voice quality conversion method. The proposed voice conversion method is able to achieve the desired perceptual effects and the modified speech remained as natural sounding and intelligible as natural speech. In this thesis, we have also developed an optimal wavelet thresholding strategy for voice source signals which is able to suppress aspiration noise and still retain both the slow and the rapid variations in the voice source estimate.EThOS - Electronic Theses Online ServiceGBUnited Kingdo