The acoustic environment can degrade speech quality during communication
(e.g., video call, remote presentation, outside voice recording), and its
impact is often unknown. Objective metrics for speech quality have proven
challenging to develop given the multi-dimensionality of factors that affect
speech quality and the difficulty of collecting labeled data. Hypothesizing the
impact of acoustics on speech quality, this paper presents MOSRA: a
non-intrusive multi-dimensional speech quality metric that can predict room
acoustics parameters (SNR, STI, T60, DRR, and C50) alongside the overall mean
opinion score (MOS) for speech quality. By explicitly optimizing the model to
learn these room acoustics parameters, we can extract more informative features
and improve the generalization for the MOS task when the training data is
limited. Furthermore, we also show that this joint training method enhances the
blind estimation of room acoustics, improving the performance of current
state-of-the-art models. An additional side-effect of this joint prediction is
the improvement in the explainability of the predictions, which is a valuable
feature for many applications.Comment: Submitted to Interspeech 202