In this work, we study the features extracted by English self-supervised
learning (SSL) models in cross-lingual contexts and propose a new metric to
predict the quality of feature representations. Using automatic speech
recognition (ASR) as a downstream task, we analyze the effect of model size,
training objectives, and model architecture on the models' performance as a
feature extractor for a set of topologically diverse corpora. We develop a
novel metric, the Phonetic-Syntax Ratio (PSR), to measure the phonetic and
synthetic information in the extracted representations using deep generalized
canonical correlation analysis. Results show the contrastive loss in the
wav2vec2.0 objective facilitates more effective cross-lingual feature
extraction. There is a positive correlation between PSR scores and ASR
performance, suggesting that phonetic information extracted by monolingual SSL
models can be used for downstream tasks in cross-lingual settings. The proposed
metric is an effective indicator of the quality of the representations and can
be useful for model selection.Comment: 12 pages, 5 figures, 4 table