In Speech Emotion Recognition (SER), textual data is often used alongside
audio signals to address their inherent variability. However, the reliance on
human annotated text in most research hinders the development of practical SER
systems. To overcome this challenge, we investigate how Automatic Speech
Recognition (ASR) performs on emotional speech by analyzing the ASR performance
on emotion corpora and examining the distribution of word errors and confidence
scores in ASR transcripts to gain insight into how emotion affects ASR. We
utilize four ASR systems, namely Kaldi ASR, wav2vec2, Conformer, and Whisper,
and three corpora: IEMOCAP, MOSI, and MELD to ensure generalizability.
Additionally, we conduct text-based SER on ASR transcripts with increasing word
error rates to investigate how ASR affects SER. The objective of this study is
to uncover the relationship and mutual impact of ASR and SER, in order to
facilitate ASR adaptation to emotional speech and the use of SER in real world.Comment: Accepted to INTERSPEECH 202