522 research outputs found

    Cognitive Phonetics: The Transduction of Distinctive Features at the Phonology-Phonetics Interface

    Get PDF
    We propose that the interface between phonology and phonetics is mediated by a transduction process that converts elementary units of phonological computation, features, into temporally coordinated neuromuscular patterns, called ‘True Phonetic Representations’, which are directly interpretable by the motor system of speech production. Our view of the interface is constrained by substance-free generative phonological assumptions and by insights gained from psycholinguistic and phonetic models of speech production. To distinguish transduction of abstract phonological units into planned neuromuscular patterns from the biomechanics of speech production usually associated with physiological phonetics, we have termed this interface theory ‘Cognitive Phonetics’ (CP). The inner workings of CP are described in terms of Marr’s (1982/2010) tri-level approach, which we used to construct a linking hypothesis relating formal phonology to neurobiological activity. Potential neurobiological correlates supporting various parts of CP are presented. We also argue that CP augments the study of certain phonetic phenomena, most notably coarticulation, and suggest that some phenomena usually considered phonological (e.g., naturalness and gradience) receive better explanations within CP

    The Role of Prosodic Stress and Speech Perturbation on the Temporal Synchronization of Speech and Deictic Gestures

    Get PDF
    Gestures and speech converge during spoken language production. Although the temporal relationship of gestures and speech is thought to depend upon factors such as prosodic stress and word onset, the effects of controlled alterations in the speech signal upon the degree of synchrony between manual gestures and speech is uncertain. Thus, the precise nature of the interactive mechanism of speech-gesture production, or lack thereof, is not agreed upon or even frequently postulated. In Experiment 1, syllable position and contrastive stress were manipulated during sentence production to investigate the synchronization of speech and pointing gestures. An additional aim of Experiment 2 was to investigate the temporal relationship of speech and pointing gestures when speech is perturbed with delayed auditory feedback (DAF). Comparisons between the time of gesture apex and vowel midpoint (GA-VM) for each of the conditions were made for both Experiment 1 and Experiment 2. Additional comparisons of the interval between gesture launch midpoint to vowel midpoint (GLM-VM), total gesture time, gesture launch time, and gesture return time were made for Experiment 2. The results for the first experiment indicated that gestures were more synchronized with first position syllables and neutral syllables as measured GA-VM intervals. The first position syllable effect was also found in the second experiment. However, the results from Experiment 2 supported an effect of contrastive pitch effect. GLM-VM was shorter for first position targets and accented syllables. In addition, gesture launch times and total gesture times were longer for contrastive pitch accented syllables, especially when in the second position of words. Contrary to the predictions, significantly longer GA-VM and GLM-VM intervals were observed when individuals responded under provided delayed auditory feedback (DAF). Vowel and sentence durations increased both with (DAF) and when a contrastive accented syllable was produced. Vowels were longest for accented, second position syllables. These findings provide evidence that the timing of gesture is adjusted based upon manipulations of the speech stream. A potential mechanism of entrainment of the speech and gesture system is offered as an explanation for the observed effects

    ARTICULATORY INFORMATION FOR ROBUST SPEECH RECOGNITION

    Get PDF
    Current Automatic Speech Recognition (ASR) systems fail to perform nearly as good as human speech recognition performance due to their lack of robustness against speech variability and noise contamination. The goal of this dissertation is to investigate these critical robustness issues, put forth different ways to address them and finally present an ASR architecture based upon these robustness criteria. Acoustic variations adversely affect the performance of current phone-based ASR systems, in which speech is modeled as `beads-on-a-string', where the beads are the individual phone units. While phone units are distinctive in cognitive domain, they are varying in the physical domain and their variation occurs due to a combination of factors including speech style, speaking rate etc.; a phenomenon commonly known as `coarticulation'. Traditional ASR systems address such coarticulatory variations by using contextualized phone-units such as triphones. Articulatory phonology accounts for coarticulatory variations by modeling speech as a constellation of constricting actions known as articulatory gestures. In such a framework, speech variations such as coarticulation and lenition are accounted for by gestural overlap in time and gestural reduction in space. To realize a gesture-based ASR system, articulatory gestures have to be inferred from the acoustic signal. At the initial stage of this research an initial study was performed using synthetically generated speech to obtain a proof-of-concept that articulatory gestures can indeed be recognized from the speech signal. It was observed that having vocal tract constriction trajectories (TVs) as intermediate representation facilitated the gesture recognition task from the speech signal. Presently no natural speech database contains articulatory gesture annotation; hence an automated iterative time-warping architecture is proposed that can annotate any natural speech database with articulatory gestures and TVs. Two natural speech databases: X-ray microbeam and Aurora-2 were annotated, where the former was used to train a TV-estimator and the latter was used to train a Dynamic Bayesian Network (DBN) based ASR architecture. The DBN architecture used two sets of observation: (a) acoustic features in the form of mel-frequency cepstral coefficients (MFCCs) and (b) TVs (estimated from the acoustic speech signal). In this setup the articulatory gestures were modeled as hidden random variables, hence eliminating the necessity for explicit gesture recognition. Word recognition results using the DBN architecture indicate that articulatory representations not only can help to account for coarticulatory variations but can also significantly improve the noise robustness of ASR system

    Cognitive Phonetics: The Transduction of Distinctive Features at the Phonology-Phonetics Interface

    Get PDF
    We propose that the interface between phonology and phonetics is mediated by a transduction process that converts elementary units of phonological computation, features, into temporally coordinated neuromuscular patterns, called ‘True Phonetic Representations’, which are directly interpretable by the motor system of speech production. Our view of the interface is constrained by substance-free generative phonological assumptions and by insights gained from psycholinguistic and phonetic models of speech production. To distinguish transduction of abstract phonological units into planned neuromuscular patterns from the biomechanics of speech production usually associated with physiological phonetics, we have termed this interface theory ‘Cognitive Phonetics’ (CP). The inner workings of CP are described in terms of Marr’s (1982/2010) tri-level approach, which we used to construct a linking hypothesis relating formal phonology to neurobiological activity. Potential neurobiological correlates supporting various parts of CP are presented. We also argue that CP augments the study of certain phonetic phenomena, most notably coarticulation, and suggest that some phenomena usually considered phonological (e.g., naturalness and gradience) receive better explanations within CP

    Cognitive Phonetics: The Transduction of Distinctive Features at the Phonology-Phonetics Interface

    Get PDF
    We propose that the interface between phonology and phonetics is mediated by a transduction process that converts elementary units of phonological computation, features, into temporally coordinated neuromuscular patterns, called ‘True Phonetic Representations’, which are directly interpretable by the motor system of speech production. Our view of the interface is constrained by substance-free generative phonological assumptions and by insights gained from psycholinguistic and phonetic models of speech production. To distinguish transduction of abstract phonological units into planned neuromuscular patterns from the biomechanics of speech production usually associated with physiological phonetics, we have termed this interface theory ‘Cognitive Phonetics’ (CP). The inner workings of CP are described in terms of Marr’s (1982/2010) tri-level approach, which we used to construct a linking hypothesis relating formal phonology to neurobiological activity. Potential neurobiological correlates supporting various parts of CP are presented. We also argue that CP augments the study of certain phonetic phenomena, most notably coarticulation, and suggest that some phenomena usually considered phonological (e.g., naturalness and gradience) receive better explanations within CP

    Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab

    Get PDF
    Articulatory copy synthesis (ACS), a subarea of speech inversion, refers to the reproduction of natural utterances and involves both the physiological articulatory processes and their corresponding acoustic results. This thesis proposes two novel methods for the ACS of human speech using the articulatory speech synthesizer VocalTractLab (VTL) to address or mitigate the existing problems of speech inversion, such as non-unique mapping, acoustic variation among different speakers, and the time-consuming nature of the process. The first method involved finding appropriate VTL gestural scores for given natural utterances using a genetic algorithm. It consisted of two steps: gestural score initialization and optimization. In the first step, gestural scores were initialized using the given acoustic signals with speech recognition, grapheme-to-phoneme (G2P), and a VTL rule-based method for converting phoneme sequences to gestural scores. In the second step, the initial gestural scores were optimized by a genetic algorithm via an analysis-by-synthesis (ABS) procedure that sought to minimize the cosine distance between the acoustic features of the synthetic and natural utterances. The articulatory parameters were also regularized during the optimization process to restrict them to reasonable values. The second method was based on long short-term memory (LSTM) and convolutional neural networks, which were responsible for capturing the temporal dependence and the spatial structure of the acoustic features, respectively. The neural network regression models were trained, which used acoustic features as inputs and produced articulatory trajectories as outputs. In addition, to cover as much of the articulatory and acoustic space as possible, the training samples were augmented by manipulating the phonation type, speaking effort, and the vocal tract length of the synthetic utterances. Furthermore, two regularization methods were proposed: one based on the smoothness loss of articulatory trajectories and another based on the acoustic loss between original and predicted acoustic features. The best-performing genetic algorithms and convolutional LSTM systems (evaluated in terms of the difference between the estimated and reference VTL articulatory parameters) obtained average correlation coefficients of 0.985 and 0.983 for speaker-dependent utterances, respectively, and their reproduced speech achieved recognition accuracies of 86.25% and 64.69% for speaker-independent utterances of German words, respectively. When applied to German sentence utterances, as well as English and Mandarin Chinese word utterances, the neural network based ACS systems achieved recognition accuracies of 73.88%, 52.92%, and 52.41%, respectively. The results showed that both of these methods not only reproduced the articulatory processes but also reproduced the acoustic signals of reference utterances. Moreover, the regularization methods led to more physiologically plausible articulatory processes and made the estimated articulatory trajectories be more articulatorily preferred by VTL, thus reproducing more natural and intelligible speech. This study also found that the convolutional layers, when used in conjunction with batch normalization layers, automatically learned more distinctive features from log power spectrograms. Furthermore, the neural network based ACS systems trained using German data could be generalized to the utterances of other languages

    The co-development of manual and vocal activity in infants

    Get PDF
    Manual and vocal actions in humans are coupled throughout the lifespan, from the anticipatory opening of the mouth as the hand moves to meet it in natal development to the more sophisticated co-expressive gesture of the proficient communicator (Iverson & Thelen, 1999). By adulthood, the systems supporting both speech and manual actions of gesture are so wholly integrated that the expression of both actions together is seamless and effortless (Gentilucci & Nicoladis, 2008). Both systems, though controlled by different muscles moving different articulators, exhibit parallels in their development and organization (Meier & Willerman, 1995). The manual control supporting gesture emerges earlier than the vocal control supporting speech (Ejiri & Masataka, 2001), and the actions of the hands and arms may encourage organization and patterns of vocal control (Iverson & Fagan, 2004). No research has yet shown the nature of this manual development in the context of vocal development. This study investigates the emergence and practice of manual configurations during vocal and linguistic development in eight typically developing infants. By observing the manual system only during vocal actions, while the participants progress through babble but before referential word use, this study demonstrates the nature of the development between these systems before being structured by language. These results illustrate the unique coupling of the vocal and motor systems and demonstrate the existence of manual configurations analogous to the practiced vocal patterns that support the development of language
    • 

    corecore