3 research outputs found

    Automatic generation of the complete vocal tract shape from the sequence of phonemes to be articulated

    No full text
    International audienceArticulatory speech synthesis requires generating realistic vocal tract shapes from thesequence of phonemes to be articulated. This work proposes the first model trained fromrt-MRI films to automatically predict all of the vocal tract articulators’ contours. The dataare the contours tracked in the rt-MRI database recorded for one speaker. Those contourswere exploited to train an encoder-decoder network to map the sequence of phonemes andtheir durations to the exact gestures performed by the speaker. Different from other works,all the individual articulator contours are predicted separately, allowing the investigation oftheir interactions. We measure four tract variables closely coupled with critical articulatorsand observe their variations over time. The test demonstrates that our model can producehigh-quality shapes of the complete vocal tract with a good correlation between the predictedand the target variables observed in rt-MRI films, even though the tract variables are notincluded in the optimization procedure

    MODELING THE TEMPORAL EVOLUTION OF THE VOCAL TRACT SHAPE WITH DEEP LEARNING

    No full text
    International audienceThis paper overviews our work on the links between coarticulation modeling, approached from the point of view of predicting the vocal tract shape from the phonetic sequence, and the available real-time MRI corpora. Real-time MRI has revolutionized the acquisition of articulatory data through the image quality, the possibility of acquiring and denoising the speech signal, and the possibility of recording corpora containing several thousand sentences. Coarticulation modeling is only possible with the ability to reliably track articulator contours in many images. Tracking techniques using neural networks have provided efficient solutions comparable in reliability to humans. Finally, we show that even if recurrent neural networks trained on these corpora can successfully predict the shape of the vocal tract, it is still necessary to use constraints directly from phonetics to ensure consistency in the prediction

    Automatic segmentation of vocal tract articulators in real-time magnetic resonance imaging

    No full text
    International audienceBackground and Objectives: The characterization of the vocal tract geometry during speech interests various research topics, including speech production modeling, motor control analysis, and speech therapy design. Real-time MRI is a reliable and non-invasive tool for this purpose. In most cases, it is necessary to know the contours of the individual articulators from the glottis to the lips. Several techniques have been proposed for segmenting vocal tract articulators, but most are limited to specific applications. Moreover, they often do not provide individualized contours for all soft-tissue articulators in a multi-speaker configuration. Methods: A Mask R-CNN network was trained to detect and segment the vocal tract articulator contours in two real-time MRI (RT-MRI) datasets with speech recordings of multiple speakers. Two post-processing algorithms were then proposed to convert the network's outputs into geometrical curves. Nine articulators were considered: the two lips, tongue, soft palate, pharynx, arytenoid cartilage, epiglottis, thyroid cartilage, and vocal folds. A leave-one-out cross-validation protocol was used to evaluate inter-speaker generalization. The evaluation metrics were the point-to-closest-point distance and the Jaccard index (for articulators annotated as closed contours). Results: The proposed method accurately segmented the vocal tract articulators, with an average root mean square point-to-closest-point distance of less than 2.2 mm for all the Preprint submitted to Computer Methods and Programs in Biomedicine January 7, 2024 articulators in the leave-one-out cross-validation setting. The minimum P2CP RMS was 0.91 mm for the upper lip, and the maximum was 2.18 mm for the tongue. The Jaccard indices for the thyroid cartilage and vocal folds were 0.60 and 0.61, respectively. Additionally, the method adapted to a new subject with only ten annotated samples. Conclusions: Our research introduced a method for individually segmenting nine non-rigid vocal tract articulators in real-time MRI movies. The software is openly available as an installable package to the speech community. It is designed to develop speech applications and clinical and non-clinical research in fields that require vocal tract geometry, such as speech, singing, and human beatboxing
    corecore