This article investigates the knowledge transfer from the RuQTopics dataset.
This Russian topical dataset combines a large sample number (361,560
single-label, 170,930 multi-label) with extensive class coverage (76 classes).
We have prepared this dataset from the "Yandex Que" raw data. By evaluating the
RuQTopics - trained models on the six matching classes of the Russian MASSIVE
subset, we have proved that the RuQTopics dataset is suitable for real-world
conversational tasks, as the Russian-only models trained on this dataset
consistently yield an accuracy around 85\% on this subset. We also have figured
out that for the multilingual BERT, trained on the RuQTopics and evaluated on
the same six classes of MASSIVE (for all MASSIVE languages), the language-wise
accuracy closely correlates (Spearman correlation 0.773 with p-value 2.997e-11)
with the approximate size of the pretraining BERT's data for the corresponding
language. At the same time, the correlation of the language-wise accuracy with
the linguistical distance from Russian is not statistically significant