Domain-Specific Customization for Improving Speech to Text

Tobias, Zubin Mario

text

oai:repository.rit.edu:theses-13214

Domain-Specific Customization for Improving Speech to Text

Authors: Zubin Mario Tobias
Publication date: 1 January 2025
Publisher: RIT Digital Institutional Repository

Abstract

The advent of transformer-based models has revolutionized natural language processing, bringing remarkable improvements in tasks like automatic speech recognition (ASR). Inspired by these advancements, this thesis explores the optimization of a transformer-based ASR model to improve transcription accuracy in educational settings, particularly for lecture content. The goal of this research is to provide real-time, high-accuracy captions that enhance accessibility for all students, while offering a cost-effective solution for educators. To assess the potential of domain-specific fine-tuning, Whisper-small underwent two phases of fine-tuning. In the first phase, it was finetuned on care- fully selected, publicly available datasets: SpeechColab’s Gigaspeech-XS [39], AMI Meeting corpus [14]. In the second phase, fine-tuned model was optimized on a self-curated dataset [16] consisting of roughly 10 hours of live lecture recordings collected and assembled by me. Finally, a real-time captioning assistant application was developed to leverage the finetuned model and transcribe speech in real time with live editing capabilities. The optimized Whisper-small model was evaluated against Whisper’s retrained small, medium and large(version 2) counterparts. The evaluation was performed on a clean unseen data [15] prepared by me. The fine-tuned model achieved lower Word Error Rates (WER) of 4.53%, compared to 5.51% and 5.78% for Whisper-Medium and Whisper-Large-V2 respectively. These results demonstrate that fine-tuning a transformer-based ASR model on domain- specific data can significantly enhance its performance in a targeted context, such as live lecture transcription. The findings of this experiment highlight the promise of transformer-based models for improving educational accessibility. From thereon, building an application tailored to live lecture settings, this research contributes to the development of adaptable, low-cost technologies that support inclusive learning environments. The success of this experiment lays the groundwork for future breakthroughs in speech recognition, aiming to make education more accessible for everyone

Similar works

Full text

Open in the Core reader

Download PDF

RIT Digital Institutional Repository

oai:repository.rit.edu:theses-...

Last time updated on 14/06/2025

This paper was published in RIT Digital Institutional Repository.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.