Speech data has rich acoustic and paralinguistic information with important
cues for understanding a speaker's tone, emotion, and intent, yet traditional
large language models such as BERT do not incorporate this information. There
has been an increased interest in multi-modal language models leveraging audio
and/or visual information and text. However, current multi-modal language
models require both text and audio/visual data streams during inference/test
time. In this work, we propose a methodology for training language models
leveraging spoken language audio data but without requiring the audio stream
during prediction time. This leads to an improved language model for analyzing
spoken transcripts while avoiding an audio processing overhead at test time. We
achieve this via an audio-language knowledge distillation framework, where we
transfer acoustic and paralinguistic information from a pre-trained speech
embedding (OpenAI Whisper) teacher model to help train a student language model
on an audio-text dataset. In our experiments, the student model achieves
consistent improvement over traditional language models on tasks analyzing
spoken transcripts.Comment: 11 page