Toward understanding speech planning by observing its execution – Representations, modeling and analysis

Abstract

This thesis proposes a balanced framework toward understanding speech motor planning and control by observing aspects of its behavioral execution. To this end, it proposes representing, modeling, and analyzing real-time speech articulation data from both `top-down' (or knowledge-driven) as well as `bottom-up' (or data-driven) perspectives. The first part of the thesis uses existing knowledge from linguistics and motor control to extract meaningful representations from real-time magnetic resonance imaging (rtMRI) data, and further, posit and test specific hypotheses regarding kinematic and postural planning during pausing behavior. In the former case, we propose a measure to quantify the speed of articulators during pauses as well as during their immediate neighborhoods. Using appropriate statistical analysis techniques, we find support for the hypothesis that pauses at major syntactic boundaries (i.e., grammatical pauses), but not ungrammatical (e.g., word search) pauses, are planned by a high-level cognitive mechanism that also controls the rate of articulation around these junctures. In the latter case, we present a novel automatic procedure to characterize vocal posture from rtMRI data. Statistical analyses suggest that articulatory settings differ during rest positions, ready positions and inter-speech pauses, and might, in that order, involve an increasing degree of active control by the cognitive speech planning mechanism. We show that this may be due to the fact that postures assumed during pauses are significantly more mechanically advantageous than postures assumed during absolute rest. In other words, inter-speech postures allow for a larger change in the space of motor control tasks/goals for a minimal change in the articulatory posture space as compared to postures at absolute rest. We argue that such top-down approaches can be used to augment models of speech motor control. The second part of the thesis presents a computational, data-driven approach to derive interpretable movement primitives from speech articulation data in a bottom-up manner. It puts forth a convolutive Nonnegative Matrix Factorization algorithm with sparseness constraints (cNMFsc) to decompose a given data matrix into a set of spatiotemporal basis sequences and an activation matrix. The algorithm optimizes a cost function that trades off the mismatch between the proposed model and the input data against the number of primitives that are active at any given instant. The method is applied to both measured articulatory data obtained through electromagnetic articulography (EMA) as well as synthetic data generated using an articulatory synthesizer. The paper then describes how to evaluate the algorithm performance quantitatively and further performs a qualitative assessment of the algorithm's ability to recover compositional structure from data. The results suggest that the proposed algorithm extracts movement primitives from human speech production data that are linguistically interpretable. We further examine how well derived representations of "primitive movements'' of speech articulation can be used to classify broad phone categories, and thus provide more insights into the link between speech production and perception. We finally show that such primitives can be mathematically modeled using nonlinear dynamical systems in a control-theoretic framework for speech motor control. Such a primitives-based framework could thus help inform practicable theories of speech motor control and coordination

    Similar works