21,912 research outputs found
Automated Detection of Non-Relevant Posts on the Russian Imageboard "2ch": Importance of the Choice of Word Representations
This study considers the problem of automated detection of non-relevant posts
on Web forums and discusses the approach of resolving this problem by
approximation it with the task of detection of semantic relatedness between the
given post and the opening post of the forum discussion thread. The
approximated task could be resolved through learning the supervised classifier
with a composed word embeddings of two posts. Considering that the success in
this task could be quite sensitive to the choice of word representations, we
propose a comparison of the performance of different word embedding models. We
train 7 models (Word2Vec, Glove, Word2Vec-f, Wang2Vec, AdaGram, FastText,
Swivel), evaluate embeddings produced by them on dataset of human judgements
and compare their performance on the task of non-relevant posts detection. To
make the comparison, we propose a dataset of semantic relatedness with posts
from one of the most popular Russian Web forums, imageboard "2ch", which has
challenging lexical and grammatical features.Comment: 6 pages, 1 figure, 1 table, main proceedings of AIST-2017 (Analysis
of Images, Social Networks, and Texts
Russian word sense induction by clustering averaged word embeddings
The paper reports our participation in the shared task on word sense
induction and disambiguation for the Russian language (RUSSE-2018). Our team
was ranked 2nd for the wiki-wiki dataset (containing mostly homonyms) and 5th
for the bts-rnc and active-dict datasets (containing mostly polysemous words)
among all 19 participants.
The method we employed was extremely naive. It implied representing contexts
of ambiguous words as averaged word embedding vectors, using off-the-shelf
pre-trained distributional models. Then, these vector representations were
clustered with mainstream clustering techniques, thus producing the groups
corresponding to the ambiguous word senses. As a side result, we show that word
embedding models trained on small but balanced corpora can be superior to those
trained on large but noisy data - not only in intrinsic evaluation, but also in
downstream tasks like word sense induction.Comment: Proceedings of the 24rd International Conference on Computational
Linguistics and Intellectual Technologies (Dialogue-2018
Combining Thesaurus Knowledge and Probabilistic Topic Models
In this paper we present the approach of introducing thesaurus knowledge into
probabilistic topic models. The main idea of the approach is based on the
assumption that the frequencies of semantically related words and phrases,
which are met in the same texts, should be enhanced: this action leads to their
larger contribution into topics found in these texts. We have conducted
experiments with several thesauri and found that for improving topic models, it
is useful to utilize domain-specific knowledge. If a general thesaurus, such as
WordNet, is used, the thesaurus-based improvement of topic models can be
achieved with excluding hyponymy relations in combined topic models.Comment: Accepted to AIST-2017 conference (http://aistconf.ru/). The final
publication will be available at link.springer.co
- …