Most Transformer language models are primarily pretrained on English text,
limiting their use for other languages. As the model sizes grow, the
performance gap between English and other languages with fewer compute and data
resources increases even further. Consequently, more resource-efficient
training methods are needed to bridge the gap for languages with fewer
resources available. To address this problem, we introduce a cross-lingual and
progressive transfer learning approach, called CLP-Transfer, that transfers
models from a source language, for which pretrained models are publicly
available, like English, to a new target language. As opposed to prior work,
which focused on the cross-lingual transfer between two languages, we extend
the transfer to the model size. Given a pretrained model in a source language,
we aim for a same-sized model in a target language. Instead of training a model
from scratch, we exploit a smaller model that is in the target language but
requires much fewer resources. Both small and source models are then used to
initialize the token embeddings of the larger model based on the overlapping
vocabulary of the source and target language. All remaining weights are reused
from the model in the source language. This approach outperforms the sole
cross-lingual transfer and can save up to 80% of the training steps compared to
the random initialization