9 research outputs found

    Articulatory imaging implicates prediction during spoken language comprehension

    Get PDF
    It has been suggested that the activation of speech-motor areas during speech comprehension may, in part, reflect the involvement of the speech production system in synthesising upcoming material at an articulatorily-specified level. In this study we seek to explore that suggestion through the use of articulatory imaging. We investigate whether, and how, predictions that emerge during speech comprehension influence articulatory realisations during picture-naming.We elicited predictions by auditorily presenting high-cloze sentence-stems to participants (e.g., “When we want water we just turn on the
”). Participants named a picture immediately following each sentence-stem presentation. Pictures either matched (e.g., TAP) or mismatched (e.g., CAP) the high-cloze sentence-stem target. Throughout each trial participants’ speech-motor movements were recorded via dynamic ultrasound imaging. This allowed us to compare articulations in the match and mismatch conditions to each other and to a control condition (simple picture-naming). Articulations in the mismatch condition differed more from the control condition than did those in the match condition. This difference was reflected in a second analysis which showed greater frame-by-frame change in articulator positions for the mismatch compared to the match condition around 300-500 ms before the onset of the picture name. Our findings indicate that comprehension-elicited prediction influences speech-motor production, suggesting that the speech production system is implicated in the representation of such predictions

    ARTICULATORY INFORMATION FOR ROBUST SPEECH RECOGNITION

    Get PDF
    Current Automatic Speech Recognition (ASR) systems fail to perform nearly as good as human speech recognition performance due to their lack of robustness against speech variability and noise contamination. The goal of this dissertation is to investigate these critical robustness issues, put forth different ways to address them and finally present an ASR architecture based upon these robustness criteria. Acoustic variations adversely affect the performance of current phone-based ASR systems, in which speech is modeled as `beads-on-a-string', where the beads are the individual phone units. While phone units are distinctive in cognitive domain, they are varying in the physical domain and their variation occurs due to a combination of factors including speech style, speaking rate etc.; a phenomenon commonly known as `coarticulation'. Traditional ASR systems address such coarticulatory variations by using contextualized phone-units such as triphones. Articulatory phonology accounts for coarticulatory variations by modeling speech as a constellation of constricting actions known as articulatory gestures. In such a framework, speech variations such as coarticulation and lenition are accounted for by gestural overlap in time and gestural reduction in space. To realize a gesture-based ASR system, articulatory gestures have to be inferred from the acoustic signal. At the initial stage of this research an initial study was performed using synthetically generated speech to obtain a proof-of-concept that articulatory gestures can indeed be recognized from the speech signal. It was observed that having vocal tract constriction trajectories (TVs) as intermediate representation facilitated the gesture recognition task from the speech signal. Presently no natural speech database contains articulatory gesture annotation; hence an automated iterative time-warping architecture is proposed that can annotate any natural speech database with articulatory gestures and TVs. Two natural speech databases: X-ray microbeam and Aurora-2 were annotated, where the former was used to train a TV-estimator and the latter was used to train a Dynamic Bayesian Network (DBN) based ASR architecture. The DBN architecture used two sets of observation: (a) acoustic features in the form of mel-frequency cepstral coefficients (MFCCs) and (b) TVs (estimated from the acoustic speech signal). In this setup the articulatory gestures were modeled as hidden random variables, hence eliminating the necessity for explicit gesture recognition. Word recognition results using the DBN architecture indicate that articulatory representations not only can help to account for coarticulatory variations but can also significantly improve the noise robustness of ASR system

    Registration and statistical analysis of the tongue shape during speech production

    Get PDF
    This thesis analyzes the human tongue shape during speech production. First, a semi-supervised approach is derived for estimating the tongue shape from volumetric magnetic resonance imaging data of the human vocal tract. Results of this extraction are used to derive parametric tongue models. Next, a framework is presented for registering sparse motion capture data of the tongue by means of such a model. This method allows to generate full three-dimensional animations of the tongue. Finally, a multimodal and statistical text-to-speech system is developed that is able to synthesize audio and synchronized tongue motion from text.Diese Dissertation beschĂ€ftigt sich mit der Analyse der menschlichen Zungenform wĂ€hrend der Sprachproduktion. ZunĂ€chst wird ein semi-ĂŒberwachtes Verfahren vorgestellt, mit dessen Hilfe sich Zungenformen von volumetrischen Magnetresonanztomographie- Aufnahmen des menschlichen Vokaltrakts schĂ€tzen lassen. Die Ergebnisse dieses Extraktionsverfahrens werden genutzt, um ein parametrisches Zungenmodell zu konstruieren. Danach wird eine Methode hergeleitet, die ein solches Modell nutzt, um spĂ€rliche Bewegungsaufnahmen der Zunge zu registrieren. Dieser Ansatz erlaubt es, dreidimensionale Animationen der Zunge zu erstellen. Zuletzt wird ein multimodales und statistisches Text-to-Speech-System entwickelt, das in der Lage ist, Audio und die dazu synchrone Zungenbewegung zu synthetisieren.German Research Foundatio

    The selective use of gaze in automatic speech recognition

    Get PDF
    The performance of automatic speech recognition (ASR) degrades significantly in natural environments compared to in laboratory assessments. Being a major source of interference, acoustic noise affects speech intelligibility during the ASR process. There are two main problems caused by the acoustic noise. The first is the speech signal contamination. The second is the speakers' vocal and non-vocal behavioural changes. These phenomena elicit mismatch between the ASR training and recognition conditions, which leads to considerable performance degradation. To improve noise-robustness, exploiting prior knowledge of the acoustic noise in speech enhancement, feature extraction and recognition models are popular approaches. An alternative approach presented in this thesis is to introduce eye gaze as an extra modality. Eye gaze behaviours have roles in interaction and contain information about cognition and visual attention; not all behaviours are relevant to speech. Therefore, gaze behaviours are used selectively to improve ASR performance. This is achieved by inference procedures using noise-dependant models of gaze behaviours and their temporal and semantic relationship with speech. `Selective gaze-contingent ASR' systems are proposed and evaluated on a corpus of eye movement and related speech in different clean, noisy environments. The best performing systems utilise both acoustic and language model adaptation

    Involvement of the speech production system in prediction during comprehension: an articulatory imaging investigation

    Get PDF
    This thesis investigates the effects in speech production of prediction during speech comprehension. The topic is raised by recent theoretical models of speech comprehension, which suggest a more integrated role for speech production and comprehension mechanisms than has previously been posited. The thesis is specifically concerned with the suggestion that during speech comprehension upcoming input is simulated with reference to the listener’s own speech production system by way of efference copy. Throughout this thesis the approach taken is to investigate whether representations elicited during comprehension impact speech production. The representations of interest are those generated endogenously by the listener during prediction of upcoming input. We investigate whether predictions are represented at a form level within the listener’s speech production system. We first present an overview of the relevant literature. We then present details of a picture word interference study undertaken to confirm that the item set employed elicits typical phonological effects within a conventional paradigm in which the competing representation is perceptually available. The main body of the thesis presents evidence concerning the nature of representations arising during prediction, specifically their effect on speech output. We first present evidence from picture naming vocal response latencies. We then complement and extend this with evidence from articulatory imaging, allowing an examination of pre-acoustic aspects of speech production. To investigate effects on speech production as a dynamic motor-activity we employ the Delta method, developed to quantify articulatory variability from EPG and ultrasound recordings. We apply this technique to ultrasound data acquired during mid-sagittal imaging of the tongue and extend the approach to allow us to explore the time-course of articulation during the acoustic response latency period. We investigate whether prediction of another’s speech evokes articulatorily specified activation within the listener’s speech production system The findings presented in this thesis suggest that representations evoked as predictions during speech comprehension do affect speech motor output. However, we found no evidence to suggest that predictions are represented in an articulatorily specified manner. We discuss this conclusion with reference to models of speech production-perception that implicate efference copies in the generation of predictions during speech comprehension

    The involvement of the speech production system in prediction during comprehension

    Get PDF
    This thesis investigates the effects in speech production of prediction during speech comprehension. The topic is raised by recent theoretical models of speech comprehension, which suggest a more integrated role for speech production and comprehension mechanisms than has previously been posited. The thesis is specifically concerned with the suggestion that during speech comprehension upcoming input is simulated with reference to the listener’s own speech production system by way of efference copy.Throughout this thesis the approach taken is to investigate whether representations elicited during comprehension impact speech production. The representations of interest are those generated endogenously by the listener during prediction of upcoming input. We investigate whether predictions are represented at a form level within the listener’s speech production system. We first present an overview of the relevant literature. We then present details of a picture word interference study undertaken to confirm that the item set employed elicits typical phonological effects within a conventional paradigm in which the competing representation is perceptually available. The main body of the thesis presents evidence concerning the nature of representations arising during prediction, specifically their effect on speech output. We first present evidence from picture naming vocal response latencies. We then complement and extend this with evidence from articulatory imaging, allowing an examination of pre-acoustic aspects of speech production.To investigate effects on speech production as a dynamic motor-activity we employ the Delta method, developed to quantify articulatory variability from EPG and ultrasound recordings. We apply this technique to ultrasound data acquired during mid-sagittal imaging of the tongue and extend the approach to allow us to explore the time-course of articulation during the acoustic response latency period. We investigate whether prediction of another’s speech evokes articulatorily specified activation within the listener’s speech production systemThe findings presented in this thesis suggest that representations evoked as predictions during speech comprehension do affect speech motor output. However, we found no evidence to suggest that predictions are represented in an articulatorily specified manner. We discuss this conclusion with reference to models of speech production-perception that implicate efference copies in the generation of predictions during speech comprehension
    corecore