29 research outputs found

    Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates

    Get PDF
    International audienceDespite years of speech recognition research, little is known about which words tend to be misrecognized and why. Previous work has shown that errors increase for infrequent words, short words, and very loud or fast speech, but many other presumed causes of error (e.g., nearby disfluencies, turn-initial words, phonetic neighborhood density) have never been carefully tested. The reasons for the huge differences found in error rates between speakers also remain largely mysterious. Using a mixed-effects regression model, we investigate these and other factors by analyzing the errors of two state-of-the-art recognizers on conversational speech. Words with higher error rates include those with extreme prosodic characteristics, those occurring turn-initially or as discourse markers, and : acoustically similar words that also have similar language model probabilities. Words preceding disfluent interruption points (first repetition tokens and words before fragments) also have higher error rates. Finally, even after accounting for other factors, speaker differences cause enormous variance in error rates, suggesting that speaker error rate variance is not fully explained by differences in word choice, fluency, or prosodic characteristics. We also propose that doubly confusable pairs, rather than high neighborhood density, may better explain phonetic neighborhood errors in human speech processing

    Speech Errors Produced by EFL Learners of Islamic Boarding School in Telling English Story

    Get PDF
    The students of Islamic Boarding School Nurul Islam are used to learn spoken English, but they still make some errors when they are speaking. They often produce some speech errors when they are making an English conversation or when they are getting turn to speak in front of the class. This study aims to investigate the existence and the frequency of speech errors especially Silent Pause and Filled Pause produced by the students of Islamic Boarding School Nurul Islam in telling English Story. This research is a descriptive-qualitative with data presented in statistical forms. The object of this research is Speech Errors produced by students in telling English Story. And the respondents are 30 students from the 8th grade of English Tutorial Program in Islamic Boarding School Nurul Islam in the academic year of 2016/2017. This research was conducted by observation. It was to investigate the existence of Silent Pause and Filled Pause produced by the students in telling English story and to investigate the percentages of each speech errors. The findings of the observation in this research show 603 (100%) speech errors produced by students. Silent pause is 524 (87%) and Filled Pause is 79 (13%). 

    Prosodic Classification of Discourse Markers

    Get PDF
    The first contribution of this study is the description of the prosodic behavior of discourse markers present in two speech corpora of European Portuguese (EP) in different domains (university lectures, and map-task dialogues). The second contribution is a multiclass classification to verify, given their prosodic features, which words in both corpora are classified as discourse markers, which are disfluencies, and which correspond to words that are neither markers nor disfluencies (chunks). Our goal is to automatically predict discourse markers and include them in rich transcripts, along with other structural metadata events (e.g., disfluencies and punctuation marks) that are already encompassed in the language models of our in-house speech recognizer. Results show that the automatic classification of discourse markers is better for the lectures corpus (87%) than for the dialogue corpus (84%). Nonetheless, in both corpora, discourse markers are more easily confused with chunks than with disfluencies.info:eu-repo/semantics/publishedVersio

    Classification of Failures in the Perception of Conversational Agents (CAs) and their Implications on Patient Safety

    Get PDF
    The use of Conversational agents (CAs) in healthcare is an emerging field. These CAs seem to be effective in accomplishing administrative tasks, e.g. providing locations of care facilities and scheduling appointments. Modern CAs use machine learning (ML) to recognize, understand and generate a response. Given the criticality of many healthcare settings, ML and other component errors may result in CA failures and may cause adverse effects on patients. Therefore, in-depth assurance is required before the deployment of ML in critical clinical applications, e.g. management of medication dose or medical diagnosis. CA safety issues could arise due to diverse causes, e.g. related to user interactions, environmental factors and ML errors. In this paper, we classify failures of perception (recognition and understanding) of CAs and their sources. We also present a case study of a CA used for calculating insulin dose for gestational diabetes mellitus (GDM) patients. We then correlate identified perception failures of CAs to potential scenarios that might compromise patient safety

    Acoustic-prosodic entrainment in structural metadata events

    Get PDF
    This paper presents an acoustic-prosodic analysis of entrain- ment in a Portuguese map-task corpus. Our aim is to ana- lyze how turn-by-turn entrainment varies with distinct structural metadata events: types of sentence-like units (SU) in consecu- tive turns (e.g. interrogatives followed by declaratives, or both declaratives), and with the presence of discourse markers, affir- mative cue words, and disfluencies in the beginning of turns. Entrainment at turn-exchanges may be observed in terms of pitch, energy, duration, and voice quality. Regarding SU types, question-answer turns are the ones with stronger similarity, and declarative-interrogative pairs are the ones where less entrain- ment occurs, as expected. Moreover, in question-answer pairs, there is also stronger evidence of entrainment with Yes/No and Tag questions than with Wh- questions. In fact, these subtypes are coded in distinctive prosodic ways (moreover, the first sub- type has no associated lexical-syntactic cues in Portuguese, only prosodic). As for turn-initial structures, entrainment is stronger when the second turn begins with an affirmative cue word; less strong with ambiguous structures (such as ‘OK’), emphatic af- firmative answers, and negative answers; and scarce with dis- fluencies and discourse markers. The different degrees of local entrainment may be related with the informative structure of distinct structural metadata events.info:eu-repo/semantics/publishedVersio

    Gender Representation in French Broadcast Corpora and Its Impact on ASR Performance

    Full text link
    This paper analyzes the gender representation in four major corpora of French broadcast. These corpora being widely used within the speech processing community, they are a primary material for training automatic speech recognition (ASR) systems. As gender bias has been highlighted in numerous natural language processing (NLP) applications, we study the impact of the gender imbalance in TV and radio broadcast on the performance of an ASR system. This analysis shows that women are under-represented in our data in terms of speakers and speech turns. We introduce the notion of speaker role to refine our analysis and find that women are even fewer within the Anchor category corresponding to prominent speakers. The disparity of available data for both gender causes performance to decrease on women. However this global trend can be counterbalanced for speaker who are used to speak in the media when sufficient amount of data is available.Comment: Accepted to ACM Workshop AI4T

    Phoneme recognition and confusions in patients with sensorineural hearing loss

    Get PDF
    Hearing impaired listeners show different phoneme confusions during speech recognition testing. The aim of the study was to analyze phoneme recognition in patients with sensorineural hearing loss during word recognition testing with monosyllable words, as well as, to compare consonant confusions in different vowel context. Recognition of 18 initial and final consonants was analyzed in a total of 698 presentations of the words. There were 1154 (82.7%) correct recognitions and 100 consonant confusions (7.2%). The patients did not response at a total of 71 presentations of the words which means that consonants in 142 cases (10.2%) were not recognized, nor confused. There are no consonant confusion patterns during suprathreshold testing with real words. In cases of phoneme confusions, listeners replace the stimulus word with another word from the lexical neighborhood. In terms of the vowel context, the consonants are the most easily identified in the context of the vowel /a/. © 2023, Institute for Human Rehabilitation. All rights reserved
    corecore