251 research outputs found

    Modal Locking Between Vocal Fold Oscillations and Vocal Tract Acoustics

    Get PDF
    During voiced speech, vocal folds interact with the vocal tract acoustics. The resulting glottal source-resonator coupling has been observed using mathematical and physical models as well as in in vivo phonation. We propose a computational time-domain model of the full speech apparatus that contains a feedback mechanism from the vocal tract acoustics to the vocal fold oscillations. It is based on numerical solution of ordinary and partial differential equations defined on vocal tract geometries that have been obtained by magnetic resonance imaging. The model is used to simulate rising and falling pitch glides of [alpha, i] in the fundamental frequency (f(o)) interval [145 Hz, 315 Hz]. The interval contains the first vocal tract resonance f(R1) and the first formant F-1 of [i] as well as the fractions of the first resonance f(R1)/5, f(R1)/4, and f(R1)/3 of [alpha]. The glide simulations reveal a locking pattern in the f(o) trajectory approximately at f(R1) of [i]. The resonance fractions of [alpha] produce perturbations in the pressure signal at the lips but no locking.Peer reviewe

    Acoustic characterization of the glides /j/ and /w/ in American English

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student submitted PDF version of thesis.Includes bibliographical references (p. 141-145).Acoustic analyses were conducted to identify the characteristics that differentiate the glides /j,w/ from adjacent vowels. These analyses were performed on a recorded database of intervocalic glides, produced naturally by two male and two female speakers in controlled vocalic and prosodic contexts. Glides were found to differ significantly from adjacent vowels through RMS amplitude reduction, first formant frequency reduction, open quotient increase, harmonics-to-noise ratio reduction, and fundamental frequency reduction. The acoustic data suggest that glides differ from their cognate high vowels /i,u/ in that the glides are produced with a greater degree of constriction in the vocal tract. The narrower constriction causes an increase in oral pressure, which produces aerodynamic effects on the glottal voicing source. This interaction between the vocal tract filter and its excitation source results in skewing of the glottal waveform, increasing its open quotient and decreasing the amplitude of voicing. A listening experiment with synthetic tokens was performed to isolate and compare the perceptual salience of acoustic cues to the glottal source effects of glides and to the vocal tract configuration itself. Voicing amplitude (representing source effects) and first formant frequency (representing filter configuration) were manipulated in cooperating and conflicting patterns to create percepts of /V#V/ or /V#GV/ sequences, where Vs were high vowels and Gs were their cognate glides.(cont.) In the responses of ten naïve subjects, voicing amplitude had a greater effect on the detection of glides than first formant frequency, suggesting that glottal source effects are more important to the distinction between glides and high vowels. The results of the acoustic and perceptual studies provide evidence for an articulatory-acoustic mapping defining the glide category. It is suggested that glides are differentiated from high vowels and fricatives by articulatory-acoustic boundaries related to the aerodynamic consequences of different degrees of vocal tract constriction. The supraglottal constriction target for glides is sufficiently narrow to produce a non-vocalic oral pressure drop, but not sufficiently narrow to produce a significant frication noise source. This mapping is consistent with the theory that articulator-free features are defined by aero-mechanical interactions. Implications for phonological classification systems and speech technology applications are discussed.by Elisabeth Hon Hunt.Ph.D

    Models and Analysis of Vocal Emissions for Biomedical Applications

    Get PDF
    The International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications (MAVEBA) came into being in 1999 from the particularly felt need of sharing know-how, objectives and results between areas that until then seemed quite distinct such as bioengineering, medicine and singing. MAVEBA deals with all aspects concerning the study of the human voice with applications ranging from the neonate to the adult and elderly. Over the years the initial issues have grown and spread also in other aspects of research such as occupational voice disorders, neurology, rehabilitation, image and video analysis. MAVEBA takes place every two years always in Firenze, Italy

    Effects of an Artificially Lengthened Vocal Tract on Glottal Closed Quotient in Untrained Male Voices

    Get PDF
    The use of hard-walled narrow tubes, often called resonance tubes, for the purpose of voice therapy and voice training has a historical precedent and some theoretical support, but the mechanism of any potential benefit from the application of this technique has remained poorly understood. Fifteen vocally untrained male participants produced a series of spoken /Ψ/ vowels at a modal pitch and constant loudness, followed by a minute of repeated phonation into a hard-walled glass tube at the same pitch and loudness targets. The tube parameters and tube phonation task criteria were selected according to theoretical calculations predicting an increase in the acoustic load such that phonation would occur under conditions of near-maximum inertive reactance. Following tube phonation, each participant repeated a similar series of spoken /Ψ/ vowels. Electroglottography (EGG) was used to measure the glottal closed quotient (CQ) during each phase of the experiment. A single-subject, multiple-baseline design with direct replication across subjects was used to identify any changes in CQ across the phases of the experiment. Single-subject analysis using the method of Statistical Process Control (SPC) revealed statistically significant changes in CQ during tube phonation, but with no discernable pattern across the 15 participants. These results indicate that the use of resonance tubes can have a distinct effect on glottal closure, but the mechanism behind this change remains unclear. The implication is that vocal loading techniques such as this need to be studied further with specific attention paid to the underlying mechanism of any measured changes in glottal behavior, and especially to the role of instruction and feedback in the therapeutic and pedagogical application of these techniques

    Acoustic analysis of Sindhi speech - a pre-curser for an ASR system

    Get PDF
    The functional and formative properties of speech sounds are usually referred to as acoustic-phonetics in linguistics. This research aims to demonstrate acoustic-phonetic features of the elemental sounds of Sindhi, which is a branch of the Indo-European family of languages mainly spoken in the Sindh province of Pakistan and in some parts of India. In addition to the available articulatory-phonetic knowledge; acoustic-phonetic knowledge has been classified for the identification and classification of Sindhi language sounds. Determining the acoustic features of the language sounds helps to bring together the sounds with similar acoustic characteristics under the name of one natural class of meaningful phonemes. The obtained acoustic features and corresponding statistical results for a particular natural class of phonemes provides a clear understanding of the meaningful phonemes of Sindhi and it also helps to eliminate redundant sounds present in the inventory. At present Sindhi includes nine redundant, three interchanging, three substituting, and three confused pairs of consonant sounds. Some of the unique acoustic-phonetic features of Sindhi highlighted in this study are determining the acoustic features of the large number of the contrastive voiced implosives of Sindhi and the acoustic impact of the language flexibility in terms of the insertion and digestion of the short vowels in the utterance. In addition to this the issue of the presence of the affricate class of sounds and the diphthongs in Sindhi is addressed. The compilation of the meaningful language phoneme set by learning their acoustic-phonetic features serves one of the major goals of this study; because twelve such sounds of Sindhi are studied that are not yet part of the language alphabet. The main acoustic features learned for the phonological structures of Sindhi are the fundamental frequency, formants, and the duration — along with the analysis of the obtained acoustic waveforms, the formant tracks and the computer generated spectrograms. The impetus for doing such research comes from the fact that detailed knowledge of the sound characteristics of the language-elements has a broad variety of applications — from developing accurate synthetic speech production systems to modeling robust speaker-independent speech recognizers. The major research achievements and contributions this study provides in the field include the compilation and classification of the elemental sounds of Sindhi. Comprehensive measurement of the acoustic features of the language sounds; suitable to be incorporated into the design of a Sindhi ASR system. Understanding of the dialect specific acoustic variation of the elemental sounds of Sindhi. A speech database comprising the voice samples of the native Sindhi speakers. Identification of the language‘s redundant, substituting and interchanging pairs of sounds. Identification of the language‘s sounds that can potentially lead to the segmentation and recognition errors for a Sindhi ASR system design. The research achievements of this study create the fundamental building blocks for future work to design a state-of-the-art prototype, which is: gender and environment independent, continuous and conversational ASR system for Sindhi

    Vowel recognition in continuous speech

    Get PDF
    In continuous speech, the identification of phonemes requires the ability to extract features that are capable of characterizing the acoustic signal. Previous work has shown that relatively high classification accuracy can be obtained from a single spectrum taken during the steady-state portion of the phoneme, assuming that the phonetic environment is held constant. The present study represents an attempt to extend this work to variable phonetic contexts by using dynamic rather than static spectral information. This thesis has four aims: 1) Classify vowels in continuous speech; 2) Find the optimal set of features that best describe the vowel regions; 3) Compare the classification results using a multivariate maximum likelihood distance measure with those of a neural network using the backpropagation model; 4) Examine the classification performance of a Hidden Markov Model given a pathway through phonetic space

    Investigating the build-up of precedence effect using reflection masking

    Get PDF
    The auditory processing level involved in the build‐up of precedence [Freyman et al., J. Acoust. Soc. Am. 90, 874–884 (1991)] has been investigated here by employing reflection masked threshold (RMT) techniques. Given that RMT techniques are generally assumed to address lower levels of the auditory signal processing, such an approach represents a bottom‐up approach to the buildup of precedence. Three conditioner configurations measuring a possible buildup of reflection suppression were compared to the baseline RMT for four reflection delays ranging from 2.5–15 ms. No buildup of reflection suppression was observed for any of the conditioner configurations. Buildup of template (decrease in RMT for two of the conditioners), on the other hand, was found to be delay dependent. For five of six listeners, with reflection delay=2.5 and 15 ms, RMT decreased relative to the baseline. For 5‐ and 10‐ms delay, no change in threshold was observed. It is concluded that the low‐level auditory processing involved in RMT is not sufficient to realize a buildup of reflection suppression. This confirms suggestions that higher level processing is involved in PE buildup. The observed enhancement of reflection detection (RMT) may contribute to active suppression at higher processing levels

    Concatenative speech synthesis: a Framework for Reducing Perceived Distortion when using the TD-PSOLA Algorithm

    Get PDF
    This thesis presents the design and evaluation of an approach to concatenative speech synthesis using the Titne-Domain Pitch-Synchronous OverLap-Add (I'D-PSOLA) signal processing algorithm. Concatenative synthesis systems make use of pre-recorded speech segments stored in a speech corpus. At synthesis time, the `best' segments available to synthesise the new utterances are chosen from the corpus using a process known as unit selection. During the synthesis process, the pitch and duration of these segments may be modified to generate the desired prosody. The TD-PSOLA algorithm provides an efficient and essentially successful solution to perform these modifications, although some perceptible distortion, in the form of `buzzyness', may be introduced into the speech signal. Despite the popularity of the TD-PSOLA algorithm, little formal research has been undertaken to address this recognised problem of distortion. The approach in the thesis has been developed towards reducing the perceived distortion that is introduced when TD-PSOLA is applied to speech. To investigate the occurrence of this distortion, a psychoacoustic evaluation of the effect of pitch modification using the TD-PSOLA algorithm is presented. Subjective experiments in the form of a set of listening tests were undertaken using word-level stimuli that had been manipulated using TD-PSOLA. The data collected from these experiments were analysed for patterns of co- occurrence or correlations to investigate where this distortion may occur. From this, parameters were identified which may have contributed to increased distortion. These parameters were concerned with the relationship between the spectral content of individual phonemes, the extent of pitch manipulation, and aspects of the original recordings. Based on these results, a framework was designed for use in conjunction with TD-PSOLA to minimise the possible causes of distortion. The framework consisted of a novel speech corpus design, a signal processing distortion measure, and a selection process for especially problematic phonemes. Rather than phonetically balanced, the corpus is balanced to the needs of the signal processing algorithm, containing more of the adversely affected phonemes. The aim is to reduce the potential extent of pitch modification of such segments, and hence produce synthetic speech with less perceptible distortion. The signal processingdistortion measure was developed to allow the prediction of perceptible distortion in pitch-modified speech. Different weightings were estimated for individual phonemes,trained using the experimental data collected during the listening tests.The potential benefit of such a measure for existing unit selection processes in a corpus-based system using TD-PSOLA is illustrated. Finally, the special-case selection process was developed for highly problematic voiced fricative phonemes to minimise the occurrence of perceived distortion in these segments. The success of the framework, in terms of generating synthetic speech with reduced distortion, was evaluated. A listening test showed that the TD-PSOLA balanced speech corpus may be capable of generating pitch-modified synthetic sentences with significantly less distortion than those generated using a typical phonetically balanced corpus. The voiced fricative selection process was also shown to produce pitch-modified versions of these phonemes with less perceived distortion than a standard selection process. The listening test then indicated that the signal processing distortion measure was able to predict the resulting amount of distortion at the sentence-level after the application of TD-PSOLA, suggesting that it may be beneficial to include such a measure in existing unit selection processes. The framework was found to be capable of producing speech with reduced perceptible distortion in certain situations, although the effects seen at the sentence-level were less than those seen in the previous investigative experiments that made use of word-level stimuli. This suggeststhat the effect of the TD-PSOLA algorithm cannot always be easily anticipated due to the highly dynamic nature of speech, and that the reduction of perceptible distortion in TD-PSOLA-modified speech remains a challenge to the speech community
    corecore