6 research outputs found

    Adaptation of Whisper models to child speech recognition

    Full text link
    Automatic Speech Recognition (ASR) systems often struggle with transcribing child speech due to the lack of large child speech datasets required to accurately train child-friendly ASR models. However, there are huge amounts of annotated adult speech datasets which were used to create multilingual ASR models, such as Whisper. Our work aims to explore whether such models can be adapted to child speech to improve ASR for children. In addition, we compare Whisper child-adaptations with finetuned self-supervised models, such as wav2vec2. We demonstrate that finetuning Whisper on child speech yields significant improvements in ASR performance on child speech, compared to non finetuned Whisper models. Additionally, utilizing self-supervised Wav2vec2 models that have been finetuned on child speech outperforms Whisper finetuning.Comment: Accepted in Interspeech 202

    Improvement of automatic speech recognition skills of linguistics students through using ukrainian-english and ukrainian-german subtitles in publicistic movies

    Get PDF
    The increased world's attention to foreign language studies facilitates the development and improvement of its study system in higher education institutions. Such a system takes into account and promptly responds to the demands of today's multicultural society. All should start with the transformation and modernization of the higher education system. This system includes the introduction of innovative technologies in the study of English and German, which should be focused on the modern demands of the world labor market. All this has determined the relevance of the research. This article aims to establish ways for students to gain automatic recognition skills through subtitling Ukrainian-English and Ukrainian-German publicistic movies and series. The first assessment of new language audio and video corpus was developed at the Admiral Makarov National University of Shipbuilding, using an automatic subtitling mechanism to improve linguistics students' recognition and understanding of oral speech. The skills and abilities that improved during the work with the educational movie corpus have been identified

    Correlating ASR Errors with Developmental Changes in Speech Production: A Study of 3-10-Year-Old European Portuguese Children's Speech

    Get PDF
    International audienceAutomatically recognising children's speech is a very difficult task. This difficulty can be attributed to the high variability in children's speech, both within and across speakers. The variability is due to developmental changes in children's anatomy, speech production skills et cetera, and manifests itself, for example, in fundamental and formant frequencies, the frequency of disfluencies, and pronunciation quality. In this paper, we report the results of acoustic and auditory analyses of 3-10-year-old European Portuguese children's speech. Furthermore, we are able to correlate some of the pronunciation error patterns revealed by our analyses - such as the truncation of consonant clusters - with the errors made by a children's speech recogniser trained on speech collected from the same age group. Other pronunciation error patterns seem to have little or no impact on speech recognition performance. In future work, we will attempt to use our findings to improve the performance of our recogniser

    A Developmental Study of ASR-Enhanced E-Book Software to Improve on-Task Interaction for First Grade Users

    Get PDF
    This developmental research study explores the effects of audio-visual feedback and user input mechanisms on user behaviors and satisfaction, through development of a first-grade reading program for the computer. Specific design elements investigated include human vs. synthesized audio feedback, segmented vs. whole-word pronunciation, format of supporting graphic (image vs. animation), use of automatic speech recognition (ASR) to encourage or enforce oral reading of an e-book, and effect of tutorial with mouse-click word identification or ASR-controlled word synthesis games. The study examines a variety of quantitative and qualitative measures including use logs, recorded screen-capture videos of use sessions, one-on-one interviews, and satisfaction surveys. The results of testing for each design element are analyzed and most appropriate design choice is implemented for subsequent design phases in an iterative manner. Design guidelines are given confirm the existing literature\u27s findings of user preference for human speech over computer-synthesized speech (TTS) and that supporting graphics can increase user satisfaction of e-books, but also have the potential for distraction and reduction of active reading tasks. ASR was found to be ineffective as an input mechanism due to user error and low success rate in this study, but was found to be better-suited as a tool for smaller discrete tasks such as word synthesis practice and games, and may be effective for practice or support when offered as an optional tool to be used voluntarily by users. (Keywords: literacy, reading, e-books, CAI, educational technology, elementary education, animation, synthesized speech, TTS, ASR, developmental research

    Producing Acoustic-Prosodic Entrainment in a Robotic Learning Companion to Build Learner Rapport

    Get PDF
    abstract: With advances in automatic speech recognition, spoken dialogue systems are assuming increasingly social roles. There is a growing need for these systems to be socially responsive, capable of building rapport with users. In human-human interactions, rapport is critical to patient-doctor communication, conflict resolution, educational interactions, and social engagement. Rapport between people promotes successful collaboration, motivation, and task success. Dialogue systems which can build rapport with their user may produce similar effects, personalizing interactions to create better outcomes. This dissertation focuses on how dialogue systems can build rapport utilizing acoustic-prosodic entrainment. Acoustic-prosodic entrainment occurs when individuals adapt their acoustic-prosodic features of speech, such as tone of voice or loudness, to one another over the course of a conversation. Correlated with liking and task success, a dialogue system which entrains may enhance rapport. Entrainment, however, is very challenging to model. People entrain on different features in many ways and how to design entrainment to build rapport is unclear. The first goal of this dissertation is to explore how acoustic-prosodic entrainment can be modeled to build rapport. Towards this goal, this work presents a series of studies comparing, evaluating, and iterating on the design of entrainment, motivated and informed by human-human dialogue. These models of entrainment are implemented in the dialogue system of a robotic learning companion. Learning companions are educational agents that engage students socially to increase motivation and facilitate learning. As a learning companion’s ability to be socially responsive increases, so do vital learning outcomes. A second goal of this dissertation is to explore the effects of entrainment on concrete outcomes such as learning in interactions with robotic learning companions. This dissertation results in contributions both technical and theoretical. Technical contributions include a robust and modular dialogue system capable of producing prosodic entrainment and other socially-responsive behavior. One of the first systems of its kind, the results demonstrate that an entraining, social learning companion can positively build rapport and increase learning. This dissertation provides support for exploring phenomena like entrainment to enhance factors such as rapport and learning and provides a platform with which to explore these phenomena in future work.Dissertation/ThesisDoctoral Dissertation Computer Science 201
    corecore