In this paper, we present a dataset for the computational study of a number
of Modern Greek dialects. It consists of raw text data from four dialects of
Modern Greek, Cretan, Pontic, Northern Greek and Cypriot Greek. The dataset is
of considerable size, albeit imbalanced, and presents the first attempt to
create large scale dialectal resources of this type for Modern Greek dialects.
We then use the dataset to perform dialect idefntification. We experiment with
traditional ML algorithms, as well as simple DL architectures. The results show
very good performance on the task, potentially revealing that the dialects in
question have distinct enough characteristics allowing even simple ML models to
perform well on the task. Error analysis is performed for the top performing
algorithms showing that in a number of cases the errors are due to insufficient
dataset cleaning