The use of multilingual language models for tasks in low and high-resource
languages has been a success story in deep learning. In recent times, Arabic
has been receiving widespread attention on account of its dialectal variance.
While prior research studies have tried to adapt these multilingual models for
dialectal variants of Arabic, it still remains a challenging problem owing to
the lack of sufficient monolingual dialectal data and parallel translation data
of such dialectal variants. It remains an open problem on whether the limited
dialectical data can be used to improve the models trained in Arabic on its
dialectal variants. First, we show that multilingual-BERT (mBERT) incrementally
pretrained on Arabic monolingual data takes less training time and yields
comparable accuracy when compared to our custom monolingual Arabic model and
beat existing models (by an avg metric of +6.41). We then explore two
continual pre-training methods -- (1) using small amounts of dialectical data
for continual finetuning and (2) parallel Arabic to English data and a
Translation Language Modeling loss function. We show that both approaches help
improve performance on dialectal classification tasks (+4.64 avg. gain) when
used on monolingual models