275 research outputs found

    Tongue Movements in Feeding and Speech

    Get PDF
    The position of the tongue relative to the upper and lower jaws is regulated in part by the position of the hyoid bone, which, with the anterior and posterior suprahyoid muscles, controls the angulation and length of the floor of the mouth on which the tongue body \u27rides\u27. The instantaneous shape of the tongue is controlled by the \u27extrinsic muscles \u27 acting in concert with the \u27intrinsic \u27 muscles. Recent anatomical research in non-human mammals has shown that the intrinsic muscles can best be regarded as a \u27laminated segmental system \u27 with tightly packed layers of the \u27transverse\u27, \u27longitudinal\u27, and \u27vertical\u27 muscle fibers. Each segment receives separate innervation from branches of the hypoglosssal nerve. These new anatomical findings are contributing to the development of functional models of the tongue, many based on increasingly refined finite element modeling techniques. They also begin to explain the observed behavior of the jaw-hyoid-tongue complex, or the hyomandibular \u27kinetic chain\u27, in feeding and consecutive speech. Similarly, major efforts, involving many imaging techniques (cinefluorography, ultrasound, electro-palatography, NMRI, and others), have examined the spatial and temporal relationships of the tongue surface in sound production. The feeding literature shows localized tongue-surface change as the process progresses. The speech literature shows extensive change in tongue shape between classes of vowels and consonants. Although there is a fundamental dichotomy between the referential framework and the methodological approach to studies of the orofacial complex in feeding and speech, it is clear that many of the shapes adopted by the tongue in speaking are seen in feeding. It is suggested that the range of shapes used in feeding is the matrix for both behaviors

    Segmentation of tongue shapes during vowel production in magnetic resonance images based on statistical modelling

    Get PDF
    Quantification of the anatomic and functional aspects of the tongue is pertinent to analyse the mechanisms involved in speech production. Speech requires dynamic and complex articulation of the vocal tract organs, and the tongue is one of the main articulators during speech production. Magnetic resonance imaging has been widely used in speech-related studies. Moreover, the segmentation of such images of speech organs is required to extract reliable statistical data. However, standard solutions to analyse a large set of articulatory images have not yet been established. Therefore, this article presents an approach to segment the tongue in two-dimensional magnetic resonance images and statistically model the segmented tongue shapes. The proposed approach assesses the articulator morphology based on an active shape model, which captures the shape variability of the tongue during speech production. To validate this new approach, a dataset of mid-sagittal magnetic resonance images acquired from four subjects was used, and key aspects of the shape of the tongue during the vocal production of relevant European Portuguese vowels were evaluated

    Vocal tract acoustic measurements and their application to articulatory modelling

    Get PDF
    In the field of speech research it is agreed that more real data is required to improve the articulatory modelling of the vocal tract. Acoustic techniques may be used to acquire vocal tract data. The advance of digital signal processing has allowed the development of new experimental techniques that allow fast and efficient measurements. DSP based measurement systems were set up, and acoustic impedance and transfer function measurements were performed on a wide variety of subjects in DCU’s semianechoic chamber. The measurement systems are compact and reproducible. The variation of the wall vibration load was investigated in a wide range of human subjects. The investigation was prompted by the question: Is the wall vibration load important in the study and implementation of vocal tract and articulatory models? The results point to the possible need in acoustic to articulatory inversion, of adapting the reference model to specific subjects by separately estimating the wall impedance load

    Models and analysis of vocal emissions for biomedical applications: 5th International Workshop: December 13-15, 2007, Firenze, Italy

    Get PDF
    The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies. The Workshop has the sponsorship of: Ente Cassa Risparmio di Firenze, COST Action 2103, Biomedical Signal Processing and Control Journal (Elsevier Eds.), IEEE Biomedical Engineering Soc. Special Issues of International Journals have been, and will be, published, collecting selected papers from the conference

    Let the agents do the talking: On the influence of vocal tract anatomy no speech during ontogeny

    Get PDF

    Diphthong Synthesis using the Three-Dimensional Dynamic Digital Waveguide Mesh

    Get PDF
    The human voice is a complex and nuanced instrument, and despite many years of research, no system is yet capable of producing natural-sounding synthetic speech. This affects intelligibility for some groups of listeners, in applications such as automated announcements and screen readers. Furthermore, those who require a computer to speak - due to surgery or a degenerative disease - are limited to unnatural-sounding voices that lack expressive control and may not match the user's gender, age or accent. It is evident that natural, personalised and controllable synthetic speech systems are required. A three-dimensional digital waveguide model of the vocal tract, based on magnetic resonance imaging data, is proposed here in order to address these issues. The model uses a heterogeneous digital waveguide mesh method to represent the vocal tract airway and surrounding tissues, facilitating dynamic movement and hence speech output. The accuracy of the method is validated by comparison with audio recordings of natural speech, and perceptual tests are performed which confirm that the proposed model sounds significantly more natural than simpler digital waveguide mesh vocal tract models. Control of such a model is also considered, and a proof-of-concept study is presented using a deep neural network to control the parameters of a two-dimensional vocal tract model, resulting in intelligible speech output and paving the way for extension of the control system to the proposed three-dimensional vocal tract model. Future improvements to the system are also discussed in detail. This project considers both the naturalness and control issues associated with synthetic speech and therefore represents a significant step towards improved synthetic speech for use across society

    Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab

    Get PDF
    Articulatory copy synthesis (ACS), a subarea of speech inversion, refers to the reproduction of natural utterances and involves both the physiological articulatory processes and their corresponding acoustic results. This thesis proposes two novel methods for the ACS of human speech using the articulatory speech synthesizer VocalTractLab (VTL) to address or mitigate the existing problems of speech inversion, such as non-unique mapping, acoustic variation among different speakers, and the time-consuming nature of the process. The first method involved finding appropriate VTL gestural scores for given natural utterances using a genetic algorithm. It consisted of two steps: gestural score initialization and optimization. In the first step, gestural scores were initialized using the given acoustic signals with speech recognition, grapheme-to-phoneme (G2P), and a VTL rule-based method for converting phoneme sequences to gestural scores. In the second step, the initial gestural scores were optimized by a genetic algorithm via an analysis-by-synthesis (ABS) procedure that sought to minimize the cosine distance between the acoustic features of the synthetic and natural utterances. The articulatory parameters were also regularized during the optimization process to restrict them to reasonable values. The second method was based on long short-term memory (LSTM) and convolutional neural networks, which were responsible for capturing the temporal dependence and the spatial structure of the acoustic features, respectively. The neural network regression models were trained, which used acoustic features as inputs and produced articulatory trajectories as outputs. In addition, to cover as much of the articulatory and acoustic space as possible, the training samples were augmented by manipulating the phonation type, speaking effort, and the vocal tract length of the synthetic utterances. Furthermore, two regularization methods were proposed: one based on the smoothness loss of articulatory trajectories and another based on the acoustic loss between original and predicted acoustic features. The best-performing genetic algorithms and convolutional LSTM systems (evaluated in terms of the difference between the estimated and reference VTL articulatory parameters) obtained average correlation coefficients of 0.985 and 0.983 for speaker-dependent utterances, respectively, and their reproduced speech achieved recognition accuracies of 86.25% and 64.69% for speaker-independent utterances of German words, respectively. When applied to German sentence utterances, as well as English and Mandarin Chinese word utterances, the neural network based ACS systems achieved recognition accuracies of 73.88%, 52.92%, and 52.41%, respectively. The results showed that both of these methods not only reproduced the articulatory processes but also reproduced the acoustic signals of reference utterances. Moreover, the regularization methods led to more physiologically plausible articulatory processes and made the estimated articulatory trajectories be more articulatorily preferred by VTL, thus reproducing more natural and intelligible speech. This study also found that the convolutional layers, when used in conjunction with batch normalization layers, automatically learned more distinctive features from log power spectrograms. Furthermore, the neural network based ACS systems trained using German data could be generalized to the utterances of other languages
    corecore