114 research outputs found

    Towards Automatic Speech Identification from Vocal Tract Shape Dynamics in Real-time MRI

    Full text link
    Vocal tract configurations play a vital role in generating distinguishable speech sounds, by modulating the airflow and creating different resonant cavities in speech production. They contain abundant information that can be utilized to better understand the underlying speech production mechanism. As a step towards automatic mapping of vocal tract shape geometry to acoustics, this paper employs effective video action recognition techniques, like Long-term Recurrent Convolutional Networks (LRCN) models, to identify different vowel-consonant-vowel (VCV) sequences from dynamic shaping of the vocal tract. Such a model typically combines a CNN based deep hierarchical visual feature extractor with Recurrent Networks, that ideally makes the network spatio-temporally deep enough to learn the sequential dynamics of a short video clip for video classification tasks. We use a database consisting of 2D real-time MRI of vocal tract shaping during VCV utterances by 17 speakers. The comparative performances of this class of algorithms under various parameter settings and for various classification tasks are discussed. Interestingly, the results show a marked difference in the model performance in the context of speech classification with respect to generic sequence or video classification tasks.Comment: To appear in the INTERSPEECH 2018 Proceeding

    Analyzing speech in both time and space : generalized additive mixed models can uncover systematic patterns of variation in vocal tract shape in real-time MRI

    Get PDF
    We present a method of using generalized additive mixed models (GAMMs) to analyze midsagittal vocal tract data obtained from real-time magnetic resonance imaging (rt-MRI) video of speech production. Applied to rt-MRI data, GAMMs allow for observation of factor effects on vocal tract shape throughout two key dimensions: time (vocal tract change over the temporal course of a speech segment) and space (location of change within the vocal tract). Examples of this method are provided for rt-MRI data collected at a temporal resolution of 20 ms and a spatial resolution of 1.41 mm, for 36 native speakers of German. The rt-MRI data were quantified as 28-point semi-polar-grid aperture functions. Three test cases are provided as a way of observing vocal tract differences between: (1) /aː/ and /iː/, (2) /aː/ and /aɪ/, and (3) accentuated and unstressed /aː/. The results for each GAMM are independently validated using functional linear mixed models (FLMMs) constructed from data obtained at 20% and 80% of the vowel interval. In each case, the two methods yield similar results. In light of the method similarities, we propose that GAMMs are a robust, powerful, and interpretable method of simultaneously analyzing both temporal and spatial effects in rt-MRI video of speech

    A multispeaker dataset of raw and reconstructed speech production real-time MRI video and 3D volumetric images

    Full text link
    Real-time magnetic resonance imaging (RT-MRI) of human speech production is enabling significant advances in speech science, linguistics, bio-inspired speech technology development, and clinical applications. Easy access to RT-MRI is however limited, and comprehensive datasets with broad access are needed to catalyze research across numerous domains. The imaging of the rapidly moving articulators and dynamic airway shaping during speech demands high spatio-temporal resolution and robust reconstruction methods. Further, while reconstructed images have been published, to-date there is no open dataset providing raw multi-coil RT-MRI data from an optimized speech production experimental setup. Such datasets could enable new and improved methods for dynamic image reconstruction, artifact correction, feature extraction, and direct extraction of linguistically-relevant biomarkers. The present dataset offers a unique corpus of 2D sagittal-view RT-MRI videos along with synchronized audio for 75 subjects performing linguistically motivated speech tasks, alongside the corresponding first-ever public domain raw RT-MRI data. The dataset also includes 3D volumetric vocal tract MRI during sustained speech sounds and high-resolution static anatomical T2-weighted upper airway MRI for each subject.Comment: 27 pages, 6 figures, 5 tables, submitted to Nature Scientific Dat

    Registration and statistical analysis of the tongue shape during speech production

    Get PDF
    This thesis analyzes the human tongue shape during speech production. First, a semi-supervised approach is derived for estimating the tongue shape from volumetric magnetic resonance imaging data of the human vocal tract. Results of this extraction are used to derive parametric tongue models. Next, a framework is presented for registering sparse motion capture data of the tongue by means of such a model. This method allows to generate full three-dimensional animations of the tongue. Finally, a multimodal and statistical text-to-speech system is developed that is able to synthesize audio and synchronized tongue motion from text.Diese Dissertation beschäftigt sich mit der Analyse der menschlichen Zungenform während der Sprachproduktion. Zunächst wird ein semi-überwachtes Verfahren vorgestellt, mit dessen Hilfe sich Zungenformen von volumetrischen Magnetresonanztomographie- Aufnahmen des menschlichen Vokaltrakts schätzen lassen. Die Ergebnisse dieses Extraktionsverfahrens werden genutzt, um ein parametrisches Zungenmodell zu konstruieren. Danach wird eine Methode hergeleitet, die ein solches Modell nutzt, um spärliche Bewegungsaufnahmen der Zunge zu registrieren. Dieser Ansatz erlaubt es, dreidimensionale Animationen der Zunge zu erstellen. Zuletzt wird ein multimodales und statistisches Text-to-Speech-System entwickelt, das in der Lage ist, Audio und die dazu synchrone Zungenbewegung zu synthetisieren.German Research Foundatio

    High-resolution full-vocal-tract dynamic speech magnetic resonance imaging

    Get PDF
    Dynamic magnetic resonance imaging (MRI) holds great promise for speech-related studies because of its potential to investigate velopharyngeal motion and physiological properties jointly in real time. However, many applications of dynamic speech MRI are limited by the technical trade-offs in imaging speed, spatial coverage, spatial resolution and clinical interpretation. In particular, high-resolution dynamic speech MRI with full-vocal-tract coverage and phonetically meaningful interpretation remains a challenging goal for many speech researchers. This dissertation develops novel model-based dynamic speech MRI approaches to enable high-resolution, full-vocal-tract 3D dynamic speech MRI with quantitative characterization of the articulatory motion. Our approaches include technical developments in imaging models, data acquisition strategies and image reconstruction methods: (a) high-spatiotemporal-resolution speech MRI from sparsely sampled data is achieved by employing a low-rank imaging model that leverages the spatiotemporal correlations in dynamic speech motion; (b) a self-navigated sampling strategy is developed and employed to acquire spatiotemporal data at high imaging speed, which collects high-nominal-frame-rate cone navigators and randomized Cartesian imaging data within a single TR; (c) quantitative interpretation of speech motion is enabled by introducing a deformation-based sparsity constraint that not only improves image reconstruction quality but also analyzes articulatory motion by a high-resolution deformation field; and (d) accurate assessment of subject-specific motion as opposed to generic motion pattern is realized by using a low-rank plus sparse imaging model jointly with a technique to construct high-quality spatiotemporal atlas. Regional sparse modeling is further introduced to assist effective motion analysis in the regions of interest. Our approaches are evaluated through both simulations on numerical phantoms and in vivo validation experiments across multiple subject groups. Both simulation and experimental results allow visualization of articulatory dynamics with a frame rate of 166 frames per second, a spatial resolution of 2.2 mm x 2.2 mm x 5.0 mm, and a spatial coverage of 280 mm x 280 mm x 40 mm covering the entire upper vocal tract across 8 mid-sagittal slices. Deformation fields yielded from our approaches share an identical spatiotemporal resolution that characterizes accurate soft-tissue motion. With a high-quality atlas, the low-rank and the sparse components are reconstructed to reveal both subject-specific motion and generic speech motion across a specific subject group. The effectiveness of our approaches is demonstrated through practical phonetics investigations that include (a) integrative imaging and acoustics analysis of velopharyngeal closure; (b) understanding the formation and variation in a variety of languages, American English, North Metropolitan French, Brazilian Portuguese and Levantine Arabic; and (c) analyzing motion variability of a particular subject with respect to a specific subject group. The capabilities of our method have the potential for precise assessment of the oropharyngeal dynamics and comprehensive evaluation of speech motion

    MEASURING PRE-SPEECH ARTICULATION

    Get PDF
    Abstract: What do speakers do when they start to talk? This thesis concentrates on the articulatory aspects of this problem, and offers methodological and experimental results on tongue movement, captured using Ultrasound Tongue Imaging (UTI). Speech initiation occurs at the start of every utterance. An understanding of the timing relationship between articulatory initiation (which occurs first) and acoustic initiation (that is, the start of audible speech) has implications for speech production theories, the methodological design and interpretation of speech production experiments, and clinical studies of speech production. Two novel automated techniques for detecting articulatory onsets in UTI data were developed based on Euclidean distance. The methods are verified against manually annotated data. The latter technique is based on a novel way of identifying the region of the tongue that is first to initiate movement. Data from three speech production experiments are analysed in this thesis. The first experiment is picture naming recorded with UTI and is used to explore behavioural variation at the beginning of an utterance, and to test and develop analysis tools for articulatory data. The second experiment also uses UTI recordings, but it is specifically designed to exclude any pre-speech movements of the articulators which are not directly related to the linguistic content of the utterance itself (that is, which are not expected to be present in every full repetition of the utterance), in order to study undisturbed speech initiation. The materials systematically varied the phonetic onsets of the monosyllabic target words, and the vowel nucleus. They also provided an acoustic measure of the duration of the syllable rhyme. Statistical models analysed the timing relationships of articulatory onset, and acoustic durations of the sound segments, and the acoustic duration of the rhyme. Finally, to test a discrepancy between the results of the second UTI experiment and findings in the literature, based on data recorded with Electromagnetic Articulography (EMA), a third experiment measured a single speaker using both methods and matched materials. Using the global Pixel Difference and Scanline-based Pixel Difference analysis methods developed and verified in the first half of the thesis, the main experimental findings were as follows. First, pre-utterance silent articulation is timed in inverse correlation with the acoustic duration of the onset consonant and in positive correlation with the acoustic rhyme of the first word. Because of the latter correlation, it should be considered part of the first word. Second, comparison of UTI and EMA failed to replicate the discrepancy. Instead, EMA was found to produce longer reaction times independent of utterance type.Keywords: Speech initiation, pre-speech articulation, delayed naming, ultrasound tongue imaging, electromagnetic articulography, automated methods

    Models and Analysis of Vocal Emissions for Biomedical Applications

    Get PDF
    The Models and Analysis of Vocal Emissions with Biomedical Applications (MAVEBA) workshop came into being in 1999 from the particularly felt need of sharing know-how, objectives and results between areas that until then seemed quite distinct such as bioengineering, medicine and singing. MAVEBA deals with all aspects concerning the study of the human voice with applications ranging from the neonate to the adult and elderly. Over the years the initial issues have grown and spread also in other aspects of research such as occupational voice disorders, neurology, rehabilitation, image and video analysis. MAVEBA takes place every two years always in Firenze, Italy
    corecore