In this paper, we employ Singular Value Canonical Correlation Analysis
(SVCCA) to analyze representations learnt in a multilingual end-to-end speech
translation model trained over 22 languages. SVCCA enables us to estimate
representational similarity across languages and layers, enhancing our
understanding of the functionality of multilingual speech translation and its
potential connection to multilingual neural machine translation. The
multilingual speech translation model is trained on the CoVoST 2 dataset in all
possible directions, and we utilize LASER to extract parallel bitext data for
SVCCA analysis. We derive three major findings from our analysis: (I)
Linguistic similarity loses its efficacy in multilingual speech translation
when the training data for a specific language is limited. (II) Enhanced
encoder representations and well-aligned audio-text data significantly improve
translation quality, surpassing the bilingual counterparts when the training
data is not compromised. (III) The encoder representations of multilingual
speech translation demonstrate superior performance in predicting phonetic
features in linguistic typology prediction. With these findings, we propose
that releasing the constraint of limited data for low-resource languages and
subsequently combining them with linguistically related high-resource languages
could offer a more effective approach for multilingual end-to-end speech
translation.Comment: Accepted to Findings of EMNLP 202