In this paper, we present a distributional word embedding model trained on
one of the largest available Russian corpora: Araneum Russicum Maximum (over 10
billion words crawled from the web). We compare this model to the model trained
on the Russian National Corpus (RNC). The two corpora are much different in
their size and compilation procedures. We test these differences by evaluating
the trained models against the Russian part of the Multilingual SimLex999
semantic similarity dataset. We detect and describe numerous issues in this
dataset and publish a new corrected version. Aside from the already known fact
that the RNC is generally a better training corpus than web corpora, we
enumerate and explain fine differences in how the models process semantic
similarity task, what parts of the evaluation set are difficult for particular
models and why. Additionally, the learning curves for both models are
described, showing that the RNC is generally more robust as training material
for this task