39 research outputs found
Recommended from our members
Gender and vocal production mode discrimination using the high frequencies for speech and singing
Humans routinely produce acoustical energy at frequencies above 6 kHz during vocalization, but this frequency range is often not represented in communication devices and speech perception research. Recent advancements toward high-definition (HD) voice and extended bandwidth hearing aids have increased the interest in the high frequencies. The potential perceptual information provided by high-frequency energy (HFE) is not well characterized. We found that humans can accomplish tasks of gender discrimination and vocal production mode discrimination (speech vs. singing) when presented with acoustic stimuli containing only HFE at both amplified and normal levels. Performance in these tasks was robust in the presence of low-frequency masking noise. No substantial learning effect was observed. Listeners also were able to identify the sung and spoken text (excerpts from âThe Star-Spangled Bannerâ) with very few exposures. These results add to the increasing evidence that the high frequencies provide at least redundant information about the vocal signal, suggesting that its representation in communication devices (e.g., cell phones, hearing aids, and cochlear implants) and speech/voice synthesizers could improve these devices and benefit normal-hearing and hearing-impaired listeners
Recommended from our members
The perceptual significance of high-frequency energy in the human voice
While human vocalizations generate acoustical energy at frequencies up to (and beyond) 20 kHz, the energy at frequencies above about 5 kHz has traditionally been neglected in speech perception research. The intent of this paper is to review (1) the historical reasons for this research trend and (2) the work that continues to elucidate the perceptual significance of high-frequency energy (HFE) in speech and singing. The historical and physical factors reveal that, while HFE was believed to be unnecessary and/or impractical for applications of interest, it was never shown to be perceptually insignificant. Rather, the main causes for focus on low-frequency energy appear to be because the low-frequency portion of the speech spectrum was seen to be sufficient (from a perceptual standpoint), or the difficulty of HFE research was too great to be justifiable (from a technological standpoint). The advancement of technology continues to overcome concerns stemming from the latter reason. Likewise, advances in our understanding of the perceptual effects of HFE now cast doubt on the first cause. Emerging evidence indicates that HFE plays a more significant role than previously believed, and should thus be considered in speech and voice perception research, especially in research involving children and the hearing impaired
An approach to explaining formants (Story, 2024)
Purpose: This tutorial is a description of a possible approach to teaching the concept of formants to students in a speech science course, at either the undergraduate or graduate level. The approach is to explain formants as prominent regions of energy in the output spectrum envelope radiated at the lips, and how they arise as the superposition of vocal tract resonances on a source signal. Standing waves associated with vocal tract resonances are briefly explained and standing wave animations are provided. Animations of the temporal variation of the vocal tract, vocal tract resonances, spectra, and spectrograms, along with audio samples are included to provide dynamic demonstrations of the concept of formants.Conclusions: The explanations, accompanying demonstrations, and suggested activities are intended to provide a launching point for understanding formants and how they can be measured, analyzed, and interpreted. As a result, participants should be able to describe the meaning of the term âformantâ as it relates to a spectrum and a spectrogram, explain the difference between formants and vocal tract resonances, explain how vocal tract resonances combined with the voice source generate formants, and identify formants in both narrow-band and wide-band spectrograms and track their time-varying patterns with a formant tracking algorithm.Supplemental Material S1. Standing wave in neutral vocal tract configuration for the first resonance.Supplemental Material S2. Standing wave in neutral vocal tract configuration for the second resonance.Supplemental Material S3. Standing wave in neutral vocal tract configuration for the third resonance.Supplemental Material S4. Pressure distribution in neutral vocal tract configuration at 1000 Hz, off resonance.Supplemental Material S5. Animation of the temporal variation of the components of the source-filter representation during production of âHello, how are you.â The animation also includes an audio track that is a slowed version of the phrase generated by the TubeTalker model.Supplemental Material S6. Audio file containing the real-time voice source signal (glottal flow wave) generated during the TubeTalker simulation of âHello, how are you.âSupplemental Material S7. Audio file containing the real-time output pressure signal generated during the TubeTalker simulation of âHello, how are you.âSupplemental Material S8. Animation of the temporal variation of the vocal tract in two representations during production of âHello, how are you.â In the upper inset plot, the vocal tract is shown in tubular form, and in the main plot in the middle the vocal tract is shown in a pseudo-midsagittal form. The lower inset plot shows the simultaneous temporal variation of the frequency response function (resonances). The animation also includes an audio track that is a slowed version of the phrase generated by the TubeTalker model.Supplemental Material S9. Animation of the temporal variation of the frequency response function in three-dimensions (time, frequency, amplitude) during production of âHello, how are you.â There is a delay in middle of the animation to allow the viewer to see the full history and then the view rotates into a traditional spectrographic perspective. The animation also includes an audio track that is a slowed version of the phrase generated by the TubeTalker model.Supplemental Material S10. Animation of the temporal variation of narrow-band spectra in three-dimensions (time, frequency, amplitude) during production of âHello, how are you.â There is a delay in middle of the animation to allow the viewer to see the full history and then the view rotates into a traditional spectrographic perspective. The animation also includes an audio track that is a slowed version of the phrase generated by the TubeTalker model.Story, B. H. (2024). An approach to explaining formants. Perspectives of the ASHA Special Interest Groups. Advance online publication. https://doi.org/10.1044/2023_PERSP-23-00200</p
A model of speech production based on the acoustic relativity of the vocal tract
A model is described in which the effects of articulatory movements to produce speech are generated by specifying relative acoustic events along a time axis. These events consist of directional changes of the vocal tract resonance frequencies that, when associated with a temporal event function, are transformed via acoustic sensitivity functions, into time-varying modulations of the vocal tract shape. Because the time course of the events may be considerably overlapped in time, coarticulatory effects are automatically generated. Production of sentence-level speech with the model is demonstrated with audio samples and vocal tract animations. (C) 2019 Acoustical Society of America.6 month embargo; published online: 17 October 2019This item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at [email protected]
An acoustically-driven vocal tract model for stop consonant production
The purpose of this study was to further develop a multi-tier model of the vocal tract area function in which the modulations of shape to produce speech are generated by the product of a vowel substrate and a consonant superposition function. The new approach consists of specifying input parameters for a target consonant as a set of directional changes in the resonance frequencies of the vowel substrate. Using calculations of acoustic sensitivity functions, these "resonance deflection patterns" are transformed into time-varying deformations of the vocal tract shape without any direct specification of location or extent of the consonant constriction along the vocal tract. The configuration of the constrictions and expansions that are generated by this process were shown to be physiologically-realistic and produce speech sounds that are easily identifiable as the target consonants. This model is a useful enhancement for area function-based synthesis and can serve as a tool for understanding how the vocal tract is shaped by a talker during speech production. (C) 2016 Elsevier B.V. All rights reserved.NIH [R01-DC011275]; NSF [BCS-1145011]24 month embargo; Available online 9 December 2016This item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at [email protected]
Recommended from our members
The relation of velopharyngeal coupling area to the identification of stop versus nasal consonants in North American English based on speech generated by acoustically driven vocal tract modulations
The purpose of this study was to determine the threshold of velopharyngeal coupling area at which listeners switch from identifying a consonant as a stop to a nasal in North American English, based on V1CV2 stimuli generated with a speech production model that encodes phonetic segments as relative acoustic targets. Each V1CV2 was synthesized with a set of velopharyngeal coupling functions whose area ranged from 0 to 0.1 cm2. Results show that consonants were identified by listeners as a stop when the coupling area was less than 0.035-0.057 cm2, depending on place of articulation and final vowel. The smallest coupling area (0.035 cm2) at which the stop-to-nasal switch occurred was found for an alveolar consonant in the /aCi/ context, whereas the largest (0.057 cm2) was for a bilabial in /aCa/. For each stimulus, the balance of oral versus nasal acoustic energy was characterized by the peak nasalance during the consonant. Stimuli with peak nasalance below 40% were mostly identified by listeners as stops, whereas those above 40% were identified as nasals. This study was intended to be a precursor to further investigations using the same model but scaled to represent the developing speech production system of male and female talkers.6 month embargo; published online: 16 November 2021This item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at [email protected]
Recommended from our members
The relation of velopharyngeal coupling area and vocal tract scaling to identification of stop-nasal cognates
The purpose of this study was to determine whether the threshold of velopharyngeal (VP) coupling area at which listeners switch from identifying a consonant as a stop to a nasal in North American English was different for speech produced by a model based on an adult male, an adult female, and a 4-year-old child. V1CV2 stimuli were generated with a speech production model that encodes phonetic segments as relative acoustic targets imposed on an underlying vocal tract and laryngeal structure that can be scaled according to sex and age. Each V1CV2 was synthesized with a set of VP coupling functions whose maximum area ranged from 0 to 0.1 cm2. Results showed that scaling the vocal tract and vocal folds had essentially no effect on the VP coupling area at which listener identification shifted from stop to nasal. The range of coupling areas at which the crossover occurred was 0.037-0.049 cm2 for the male model, 0.040-0.055 cm2 for the female model, and 0.039-0.052 cm2 for the 4-year-old child model, and overall mean was 0.044 cm2. Calculations of band limited peak nasalance indicated that 85% peak nasalance during the consonant was well aligned with listener responses.6 month embargo; first published 15 December 2023This item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at [email protected]
Influence of LeftâRight Asymmetries on Voice Quality in Simulated Paramedian Vocal Fold Paralysis
Purpose: The purpose of this study was to determine the vocal fold structural and vibratory symmetries that are important to vocal function and voice quality in a simulated paramedian vocal fold paralysis. Method: A computational kinematic speech production model was used to simulate an exemplar "voice" on the basis of asymmetric settings of parameters controlling glottal configuration. These parameters were then altered individually to determine their effect on maximum flow declination rate, spectral slope, cepstral peak prominence, harmonics-to-noise ratio, and perceived voice quality. Results: Asymmetry of each of the 5 vocal fold parameters influenced vocal function and voice quality; measured change was greatest for adduction and bulging. Increasing the symmetry of all parameters improved voice, and the best voice occurred with overcorrection of adduction, followed by bulging, nodal point ratio, starting phase, and amplitude of vibration. Conclusions: Although vocal process adduction and edge bulging asymmetries are most influential in voice quality for simulated vocal fold motion impairment, amplitude of vibration and starting phase asymmetries are also perceptually important. These findings are consistent with the current surgical approach to vocal fold motion impairment, where goals include medializing the vocal process and straightening concave edges. The results also explain many of the residual postoperative voice limitations.This item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at [email protected]