800 research outputs found

    Analysis of the sensitivity of the End-Of-Turn Detection task to errors generated by the Automatic Speech Recognition process.

    Get PDF
    An End-Of-Turn Detection Module (EOTD-M) is an essential component of au- tomatic Spoken Dialogue Systems. The capability of correctly detecting whether a user’s utterance has ended or not improves the accuracy in interpreting the meaning of the message and decreases the latency in the answer. Usually, in di- alogue systems, an EOTD-M is coupled with an Automatic Speech Recognition Module (ASR-M) to transmit complete utterances to the Natural Language Un- derstanding unit. Mistakes in the ASR-M transcription can have a strong effect on the performance of the EOTD-M. The actual extent of this effect depends on the particular combination of ASR-M transcription errors and the sentence featurization techniques implemented as part of the EOTD-M. In this paper we investigate this important relationship for an EOTD-M based on semantic information and particular characteristics of the speakers (speech profiles). We introduce an Automatic Speech Recognition Simulator (ASR-SIM) that mod- els different types of semantic mistakes in the ASR-M transcription as well as different speech profiles. We use the simulator to evaluate the sensitivity to ASR-M mistakes of a Long Short-Term Memory network classifier trained in EOTD with different featurization techniques. Our experiments reveal the dif- ferent ways in which the performance of the model is influenced by the ASR-M errors. We corroborate that not only is the ASR-SIM useful to estimate the performance of an EOTD-M in customized noisy scenarios, but it can also be used to generate training datasets with the expected error rates of real working conditions, which leads to better performance.EMPATHIC IT1244-19 TIN2016-78365-R PID2019-104966GB-I00

    Detecting disfluency in spontaneous speech

    Get PDF

    Feature extraction and event detection for automatic speech recognition

    Get PDF

    Advancing Electromyographic Continuous Speech Recognition: Signal Preprocessing and Modeling

    Get PDF
    Speech is the natural medium of human communication, but audible speech can be overheard by bystanders and excludes speech-disabled people. This work presents a speech recognizer based on surface electromyography, where electric potentials of the facial muscles are captured by surface electrodes, allowing speech to be processed nonacoustically. A system which was state-of-the-art at the beginning of this book is substantially improved in terms of accuracy, flexibility, and robustness

    Advancing Electromyographic Continuous Speech Recognition: Signal Preprocessing and Modeling

    Get PDF
    Speech is the natural medium of human communication, but audible speech can be overheard by bystanders and excludes speech-disabled people. This work presents a speech recognizer based on surface electromyography, where electric potentials of the facial muscles are captured by surface electrodes, allowing speech to be processed nonacoustically. A system which was state-of-the-art at the beginning of this book is substantially improved in terms of accuracy, flexibility, and robustness

    Audiovisual prosody in interaction

    Get PDF

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Is cue-based memory retrieval \u27good-enough\u27?: Agreement, comprehension, and implicit prosody in native and bilingual speakers of English

    Full text link
    This dissertation focuses on structural and prosodic effects during reading, examining their influence on agreement processing and comprehension in native English (L1) and Spanish-English bilingual (L2) speakers. I consolidate research from three distinct areas of inquiry\u27cognitive processing models, development of reading fluency, and L1/L2 processing strategies\u27and outline a cohesive and comprehensive processing model that can be applied to speakers regardless of language profile. This model is characterized by three critical components: a cognitive model of memory retrieval, a processing paradigm that outlines how resources may be deployed online, and the role of factors such as prosody in parsing decisions. The general framework of this integrated \u27Good-enough Cue\u27 (GC) model assumes the \u27Good-Enough\u27 Hypothesis and cue-based memory retrieval as central aspects. The \u27Good-Enough\u27 Hypothesis states that all speakers have access to two processing routes: a complete syntactic route, and a \u27good enough\u27 heuristic route (Ferreira, Bailey, & Ferraro, 2002; Ferreira, 2003). In the interest of conserving resources, speakers tend to rely more on heuristics and templates whenever the task allows, and may be required to rely on this fallback route when task demand is high. In the proposed GC model, cue-based memory retrieval (CBMR) is the instantiation of the complete syntactic route for agreement and long-distance dependencies in particular (Lewis & Vasishth, 2005; Wagers, Lau, & Phillips, 2009; Wagers, 2008). When retrieval fails using CBMR (due to cue overlap, memory trace decay, or some other factor), comprehenders may compensate by applying a \u27good-enough\u27 processing heuristic, which prioritizes general comprehension over detailed syntactic computation. Prosody (or implicit prosody) may reduce processing load by either facilitating syntactic processing or otherwise assisting memory retrieval, thus reducing reliance on the good-enough fallback route. This investigation explores how text presentation format interacts with these algorithmic versus heuristic processing strategies. Most specifically, measuring whether the presentation format of text affects readers\u27 comprehension and ability to detect subject-verb agreement errors in simple and complex relative clause constructions. The experimental design manipulated text presentation to influence implicit prosody, using sentences designed to induce subject-verb agreement attraction errors. Materials included simple and embedded relative clauses with head nouns and verbs that were either matched or mismatched for number. Participants read items in one of three presentation formats: a) whole sentence, b) word-by-word, or b) phrase-by-phrase, and rated each item for grammaticality and responded to a comprehension probe. Results indicate that while overall comprehension is typically prioritized over grammatical processing (following the \u27Good-Enough\u27 Hypothesis), the effects of presentation format are differentially influential based on group differences and processing measure. For the L1 participants, facilitating the projection of phrasal prosody (phrase-by-phrase presentation) onto text enhances performance in syntactic and grammatical processing, while disrupting it via a word-by-word presentation decreases comprehension accuracy. For the L2 participants however, phrase-by-phrase presentation is not significantly beneficial for grammatical processing\u27even resulting in a decrease in comprehension accuracy. These differences provide insight into the interaction of cognitive taskload, processing strategy selection, and the role of implicit prosody in reading fluency, building toward a comprehensive processing model for speakers of varying language profiles and proficiencies
    • …
    corecore