38 research outputs found

    Effect of Visual Input on Vowel Production in English Speakers

    Get PDF
    This study analyzes whether there should be a visual component to a model of speech perception and production by comparing the jaw opening, advancement, and rounding of American English and non-English vowels in the presence and absence of a visual stimulus. Surprisingly, jaw opening did not change production, but the presence of the visual stimulus was found to be a significant factor in participants’ vowel advancement for non-English vowels. This may be explained by lip rounding, but requires further research in order to develop a full understanding of the impact of visual input on vowel production to be used in teaching and learning languages

    Acoustics and Perception of Clear Fricatives

    Get PDF
    Everyday observation indicates that speakers can naturally and spontaneously adopt a speaking style that allows them to be understood more easily when confronted with difficult communicative situations. Previous studies have demonstrated that the resulting speaking style, known as clear speech, is more intelligible than casual, conversational speech for a variety of listener populations. However, few studies have examined the acoustic properties of clearly produced fricatives in detail. In addition, it is unknown whether clear speech improves the intelligibility of fricative consonants, or how its effects on fricative perception might differ depending on listener population. Since fricatives are the cause of a large number of recognition errors both for normal-hearing listeners in adverse conditions and for hearing-impaired listeners, it is of interest to explore these issues in detail focusing on fricatives. The current study attempts to characterize the type and magnitude of adaptations in the clear production of English fricatives and determine whether clear speech enhances fricative intelligibility for normal-hearing listeners and listeners with simulated impairment. In an acoustic experiment (Experiment I), ten female and ten male talkers produced nonsense syllables containing the fricatives /f, &thetas;, s, [special characters omitted], v, δ, z, and [y]/ in VCV contexts, in both a conversational style and a clear style that was elicited by means of simulated recognition errors in feedback received from an interactive computer program. Acoustic measurements were taken for spectral, amplitudinal, and temporal properties known to influence fricative recognition. Results illustrate that (1) there were consistent overall clear speech effects, several of which (consonant duration, spectral peak location, spectral moments) were consistent with previous findings and a few (notably consonant-to-vowel intensity ratio) which were not, (2) 'contrastive' differences related to acoustic inventory and eliciting prompts were observed in key comparisons, and (3) talkers differed widely in the types and magnitude of acoustic modifications. Two perception experiments using these same productions as stimuli (Experiments II and III) were conducted to address three major questions: (1) whether clearly produced fricatives are more intelligible than conversational fricatives, (2) what specific acoustic modifications are related to clear speech intelligibility advantages, and (3) how sloping, recruiting hearing impairment interacts with clear speech strategies. Both perception experiments used an adaptive procedure to estimate the signal to (multi-talker babble) noise ratio (SNR) threshold at which minimal pair fricative categorizations could be made with 75% accuracy. Data from fourteen normal-hearing listeners (Experiment II) and fourteen listeners with simulated sloping elevated thresholds and loudness recruitment (Experiment III) indicate that clear fricatives were more intelligible overall for both listener groups. However, for listeners with simulated hearing impairment, a reliable clear speech intelligibility advantage was not found for non-sibilant pairs. Correlation analyses comparing acoustic and perceptual style-related differences across the 20 speakers encountered in the experiments indicated that a shift of energy concentration toward higher frequency regions and greater source strength was a primary contributor to the "clear fricative effect" for normal-hearing listeners but not for listeners with simulated loss, for whom information in higher frequency regions was less audible

    Automatic User-Adaptive Speaking Rate Selection

    Full text link

    Automation of the Spoken Poetry Rhyming Game in Persian

    Get PDF
    This paper aims to investigate how a Persian spoken poetry game, called Mosha'ere, can be computerized by using a Persian automatic speech recognition system trained with read speech. To do this, the text and recitation speech of the poetries of the great poets, Hafez and Sa'di, were gathered. A spoken poetry rhyming game called Chakame, was developed. It utilizes a context-dependent tri-phone HMM acoustic modeling trained by Persian read speech with normal speed to recognize beyts, i.e., lines of verses, spoken by a human user. Chakame was evaluated against two kinds of recitation speech: 100 beyts recited formally at the normal rate and another 100 beyts recited emotionally hyperarticulated at a slow rate. About 23% difference in WER shows the impact of the intrinsic features of emotional recitation speech of verses on recognition rate. However, an overall beyt recognition rate of 98.5% was obtained for Chekame

    SPA: Web-based platform for easy access to speech processing modules

    Get PDF
    This paper presents SPA, a web-based Speech Analytics platform that integrates several speech processing modules and that makes it possible to use them through the web. It was developed with the aim of facilitating the usage of the modules, without the need to know about software dependencies and specific configurations. Apart from being accessed by a web-browser, the platform also provides a REST API for easy integration with other applications. The platform is flexible, scalable, provides authentication for access restrictions, and was developed taking into consideration the time and effort of providing new services. The platform is still being improved, but it already integrates a considerable number of audio and text processing modules, including: Automatic transcription, speech disfluency classification, emotion detection, dialog act recognition, age and gender classification, non-nativeness detection, hyperarticulation detection, dialog act recognition, and two external modules for feature extraction and DTMF detection. This paper describes the SPA architecture, presents the already integrated modules, and provides a detailed description for the ones most recently integrated.info:eu-repo/semantics/publishedVersio

    Efficient error correction for speech systems using constrained re-recognition

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.Includes bibliographical references (p. 71-75).Efficient error correction of recognition output is a major barrier in the adoption of speech interfaces. This thesis addresses this problem through a novel correction framework and user interface. The system uses constraints provided by the user to enhance re-recognition, correcting errors with minimal user effort and time. In our web interface, users listen to the recognized utterance, marking incorrect words as they hear them. After they have finished marking errors, they submit the edits back to the speech recognizer where it is merged with previous edits and then converted into a finite state transducer. This FST, modeling the regions of correct and incorrect words in the recognition output, is then composed with the recognizer's language model and the utterance is re-recognized. We explored the use of our error correction technique in both the lecture and restaurant domain, evaluating the types of errors and the correction performance in each domain. With our system, we have found significant improvements over other error correction techniques such as n-best lists, re-speaking or verbal corrections, and retyping in terms of actions per correction step, corrected output rate, and ease of use.by Gregory T. Yu.M.Eng

    The Science and Art of Voice Interfaces

    Get PDF

    Compensating hyperarticulation for automatic speech recognition

    Get PDF

    Statistical distributions of consonant variants in infant-directed speech: evidence that /t/ may be exceptional

    Get PDF
    Statistical distributions of phonetic variants in spoken language influence speech perception for both language learners and mature users. We theorized that patterns of phonetic variant processing of consonants demonstrated by adults might stem in part from patterns of early exposure to statistics of phonetic variants in infant-directed (ID) speech. In particular, we hypothesized that ID speech might involve greater proportions of canonical /t/ pronunciations compared to adult-directed (AD) speech in at least some phonological contexts. This possibility was tested using a corpus of spontaneous speech of mothers speaking to other adults, or to their typically-developing infant. Tokens of word-final alveolar stops – including /t/, /d/, and the nasal stop /n/ – were examined in assimilable contexts (i.e., those followed by a word-initial labial and/or velar); these were classified as canonical, assimilated, deleted, or glottalized. Results confirmed that there were significantly more canonical pronunciations in assimilable contexts in ID compared with AD speech, an effect which was driven by the phoneme /t/. These findings suggest that at least in phonological contexts involving possible assimilation, children are exposed to more canonical /t/ variant pronunciations than adults are. This raises the possibility that perceptual processing of canonical /t/ may be partly attributable to exposure to canonical /t/ variants in ID speech. Results support the need for further research into how statistics of variant pronunciations in early language input may shape speech processing across the lifespan
    corecore