32 research outputs found
A spinning wheel for YARN : user interface for a crowdsourced thesaurus
YARN (Yet Another RussNet) project started in 2013 aims at creating a large open thesaurus for Russian using crowdsourcing. This paper describes synset assembly interface developed within the project β motivation behind it, design, usage scenarios, implementation details, and first experimental results
The Impact of Cross-Lingual Adjustment of Contextual Word Representations on Zero-Shot Transfer
Large multilingual language models such as mBERT or XLM-R enable zero-shot
cross-lingual transfer in various IR and NLP tasks. Cao et al. (2020) proposed
a data- and compute-efficient method for cross-lingual adjustment of mBERT that
uses a small parallel corpus to make embeddings of related words across
languages similar to each other. They showed it to be effective in NLI for five
European languages. In contrast we experiment with a typologically diverse set
of languages (Spanish, Russian, Vietnamese, and Hindi) and extend their
original implementations to new tasks (XSR, NER, and QA) and an additional
training regime (continual learning). Our study reproduced gains in NLI for
four languages, showed improved NER, XSR, and cross-lingual QA results in three
languages (though some cross-lingual QA gains were not statistically
significant), while mono-lingual QA performance never improved and sometimes
degraded. Analysis of distances between contextualized embeddings of related
and unrelated words (across languages) showed that fine-tuning leads to
"forgetting" some of the cross-lingual alignment information. Based on this
observation, we further improved NLI performance using continual learning.Comment: Presented at ECIR 202
KazQAD: Kazakh Open-Domain Question Answering Dataset
We introduce KazQAD -- a Kazakh open-domain question answering (ODQA) dataset
-- that can be used in both reading comprehension and full ODQA settings, as
well as for information retrieval experiments. KazQAD contains just under 6,000
unique questions with extracted short answers and nearly 12,000 passage-level
relevance judgements. We use a combination of machine translation, Wikipedia
search, and in-house manual annotation to ensure annotation efficiency and data
quality. The questions come from two sources: translated items from the Natural
Questions (NQ) dataset (only for training) and the original Kazakh Unified
National Testing (UNT) exam (for development and testing). The accompanying
text corpus contains more than 800,000 passages from the Kazakh Wikipedia. As a
supplementary dataset, we release around 61,000 question-passage-answer triples
from the NQ dataset that have been machine-translated into Kazakh. We develop
baseline retrievers and readers that achieve reasonable scores in retrieval
(NDCG@10 = 0.389 MRR = 0.382), reading comprehension (EM = 38.5 F1 = 54.2), and
full ODQA (EM = 17.8 F1 = 28.7) settings. Nevertheless, these results are
substantially lower than state-of-the-art results for English QA collections,
and we think that there should still be ample room for improvement. We also
show that the current OpenAI's ChatGPTv3.5 is not able to answer KazQAD test
questions in the closed-book setting with acceptable quality. The dataset is
freely available under the Creative Commons licence (CC BY-SA) at
https://github.com/IS2AI/KazQAD.Comment: To appear in Proceedings of the 2024 Joint International Conference
on Computational Linguistics, Language Resources and Evaluation (LREC-COLING
2024
YARN : spinning-in-progress
YARN (Yet Another RussNet), a project started in 2013, aims at creating a large open WordNet-like thesaurus for Russian by means of crowdsourcing. The first stage of the project was to create noun synsets. Currently, the resource comprises 100K+ word entries and 46K+ synsets. More than 200 people have taken part in assembling synsets throughout the project. The paper describes the linguistic, technical, and organizational principles of the project, as well as the evaluation results, lessons learned, and the future plans
LEARNING TO PREDICT CLOSED QUESTIONS ON STACK OVERFLOW // Π£ΡΠ΅Π½ΡΠ΅ Π·Π°ΠΏΠΈΡΠΊΠΈ ΠΠ€Π£. Π€ΠΈΠ·ΠΈΠΊΠΎ-ΠΌΠ°ΡΠ΅ΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΈΠ΅ Π½Π°ΡΠΊΠΈ 2013 ΡΠΎΠΌ155 N4
Π ΡΡΠ°ΡΡΠ΅ ΡΠ°ΡΡΠΌΠ°ΡΡΠΈΠ²Π°Π΅ΡΡΡ Π·Π°Π΄Π°ΡΠ° ΠΏΡΠΎΠ³Π½ΠΎΠ·ΠΈΡΠΎΠ²Π°Π½ΠΈΡ Π²Π΅ΡΠΎΡΡΠ½ΠΎΡΡΠΈ ΡΠΎΠ³ΠΎ, ΡΡΠΎ Π²ΠΎΠΏΡΠΎΡ Π½Π° ΡΠ΅ΡΠ²ΠΈΡΠ΅ Stack Overflow - ΠΏΠΎΠΏΡΠ»ΡΡΠ½ΠΎΠΌ Π²ΠΎΠΏΡΠΎΡΠ½ΠΎ-ΠΎΡΠ²Π΅ΡΠ½ΠΎΠΌ ΡΠ΅ΡΡΡΡΠ΅, ΠΏΠΎΡΠ²ΡΡΠ΅Π½Π½ΠΎΠΌ ΡΠ°Π·ΡΠ°Π±ΠΎΡΠΊΠ΅ ΠΏΡΠΎΠ³ΡΠ°ΠΌΠΌΠ½ΠΎΠ³ΠΎ ΠΎΠ±Π΅ΡΠΏΠ΅ΡΠ΅Π½ΠΈΡ - Π±ΡΠ΄Π΅Ρ Π·Π°ΠΊΡΡΡ ΠΌΠΎΠ΄Π΅ΡΠ°ΡΠΎΡΠΎΠΌ. ΠΠ°Π΄Π°ΡΠ°, Π΄Π°Π½Π½ΡΠ΅ ΠΈ ΠΌΠ΅ΡΡΠΈΠΊΠ° ΠΎΡΠ΅Π½ΠΊΠΈ ΠΊΠ°ΡΠ΅ΡΡΠ²Π° Π±ΡΠ»ΠΈ ΠΏΡΠ΅Π΄Π»ΠΎΠΆΠ΅Π½Ρ Π² ΡΠ°ΠΌΠΊΠ°Ρ
ΠΎΡΠΊΡΡΡΠΎΠ³ΠΎ ΠΊΠΎΠ½ΠΊΡΡΡΠ° ΠΏΠΎ ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠΌΡ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ Π½Π° ΡΠ΅ΡΠ²ΠΈΡΠ΅ Kaggle. Π ΠΏΡΠΎΡΠ΅ΡΡΠ΅ ΡΠ΅ΡΠ΅Π½ΠΈΡ Π·Π°Π΄Π°ΡΠΈ ΠΌΡ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π»ΠΈ ΡΠΈΡΠΎΠΊΠΈΠΉ Π½Π°Π±ΠΎΡ ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ² Π΄Π»Ρ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ, Π² ΡΠΎΠΌ ΡΠΈΡΠ»Π΅ ΠΏΡΠΈΠ·Π½Π°ΠΊΠΈ, ΠΎΠΏΠΈΡΡΠ²Π°ΡΡΠΈΠ΅ Π»ΠΈΡΠ½ΡΠ΅ Ρ
Π°ΡΠ°ΠΊΡΠ΅ΡΠΈΡΡΠΈΠΊΠΈ ΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΠ΅Π»Ρ, Π²Π·Π°ΠΈΠΌΠΎΠ΄Π΅ΠΉΡΡΠ²ΠΈΠ΅ ΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΠ΅Π»Π΅ΠΉ Π΄ΡΡΠ³ Ρ Π΄ΡΡΠ³ΠΎΠΌ, Π° ΡΠ°ΠΊΠΆΠ΅ ΡΠΎΠ΄Π΅ΡΠΆΠ°Π½ΠΈΠ΅ Π²ΠΎΠΏΡΠΎΡΠΎΠ², Π² ΡΠΎΠΌ ΡΠΈΡΠ»Π΅ ΡΠ΅ΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠ΅. Π ΠΏΡΠΎΡΠ΅ΡΡΠ΅ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ ΠΏΡΠΎΡΠ΅ΡΡΠΈΡΠΎΠ²Π°Π½ΠΎ Π½Π΅ΡΠΊΠΎΠ»ΡΠΊΠΎ Π°Π»Π³ΠΎΡΠΈΡΠΌΠΎΠ² ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ. ΠΠΎ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠ°ΠΌ ΡΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½ΡΠ° Π±ΡΠ»ΠΈ Π²ΡΡΠ²Π»Π΅Π½Ρ Π½Π°ΠΈΠ±ΠΎΠ»Π΅Π΅ Π²Π°ΠΆΠ½ΡΠ΅ ΠΏΡΠΈΠ·Π½Π°ΠΊΠΈ: Π»ΠΈΡΠ½ΡΠ΅ Ρ
Π°ΡΠ°ΠΊΡΠ΅ΡΠΈΡΡΠΈΠΊΠΈ ΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΠ΅Π»Ρ ΠΈ ΡΠ΅ΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΈΠ΅ ΠΏΡΠΈΠ·Π½Π°ΠΊΠΈ Π²ΠΎΠΏΡΠΎΡΠ°. ΠΠ°ΠΈΠ»ΡΡΡΠΈΠ΅ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΡ Π±ΡΠ»ΠΈ ΠΏΠΎΠ»ΡΡΠ΅Π½Ρ Ρ ΠΏΠΎΠΌΠΎΡΡΡ Π°Π»Π³ΠΎΡΠΈΡΠΌΠ°, ΡΠ΅Π°Π»ΠΈΠ·ΠΎΠ²Π°Π½Π½ΠΎΠ³ΠΎ Π² Π±ΠΈΠ±Π»ΠΈΠΎΡΠ΅ΠΊΠ΅ Vowpal Wabbit, - ΠΈΠ½ΡΠ΅ΡΠ°ΠΊΡΠΈΠ²Π½ΠΎΠ³ΠΎ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ Π½Π° ΠΎΡΠ½ΠΎΠ²Π΅ ΡΡΠΎΡ
Π°ΡΡΠΈΡΠ΅ΡΠΊΠΎΠ³ΠΎ Π³ΡΠ°Π΄ΠΈΠ΅Π½ΡΠ½ΠΎΠ³ΠΎ ΡΠΏΡΡΠΊΠ°. ΠΠ°ΠΈΠ»ΡΡΡΠ°Ρ ΠΏΠΎΠ»ΡΡΠ΅Π½Π½Π°Ρ Π½Π°ΠΌΠΈ ΠΎΡΠ΅Π½ΠΊΠ° ΠΏΠΎΠΏΠ°Π΄Π°Π΅Ρ Π² ΡΠΎΠΏ-5 Π»ΡΡΡΠΈΡ
ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠΎΠ² Π² ΡΠΈΠ½Π°Π»ΡΠ½ΠΎΠΉ ΡΠ°Π±Π»ΠΈΡΠ΅, Π½ΠΎ ΠΏΠΎΠ»ΡΡΠ΅Π½Π° ΠΏΠΎΡΠ»Π΅ Π΄Π°ΡΡ Π·Π°Π²Π΅ΡΡΠ΅Π½ΠΈΡ ΠΊΠΎΠ½ΠΊΡΡΡΠ°
NEREL-BIO: A Dataset of Biomedical Abstracts Annotated with Nested Named Entities
This paper describes NEREL-BIO -- an annotation scheme and corpus of PubMed
abstracts in Russian and smaller number of abstracts in English. NEREL-BIO
extends the general domain dataset NEREL by introducing domain-specific entity
types. NEREL-BIO annotation scheme covers both general and biomedical domains
making it suitable for domain transfer experiments. NEREL-BIO provides
annotation for nested named entities as an extension of the scheme employed for
NEREL. Nested named entities may cross entity boundaries to connect to shorter
entities nested within longer entities, making them harder to detect.
NEREL-BIO contains annotations for 700+ Russian and 100+ English abstracts.
All English PubMed annotations have corresponding Russian counterparts. Thus,
NEREL-BIO comprises the following specific features: annotation of nested named
entities, it can be used as a benchmark for cross-domain (NEREL -> NEREL-BIO)
and cross-language (English -> Russian) transfer. We experiment with both
transformer-based sequence models and machine reading comprehension (MRC)
models and report their results.
The dataset is freely available at https://github.com/nerel-ds/NEREL-BIO.Comment: Submitted to Bioinformatics (Publisher: Oxford University Press
ΠΠ΅ΠΊΡΠΎΡΠ½ΠΎΠ΅ ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»Π΅Π½ΠΈΠ΅ ΡΠ»ΠΎΠ² Ρ ΡΠ΅ΠΌΠ°Π½ΡΠΈΡΠ΅ΡΠΊΠΈΠΌΠΈ ΠΎΡΠ½ΠΎΡΠ΅Π½ΠΈΡΠΌΠΈ: ΡΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½ΡΠ°Π»ΡΠ½ΡΠ΅ Π½Π°Π±Π»ΡΠ΄Π΅Π½ΠΈΡ
The ability to identify semantic relations between words has made a word2vec model widely used in NLP tasks. The idea of word2vec is based on a simple rule that a higher similarity can be reached if two words have a similar context. Each word can be represented as a vector, so the closest coordinates of vectors can be interpreted as similar words. It allows to establish semantic relations (synonymy, relations of hypernymy and hyponymy and other semantic relations) by applying an automatic extraction. The extraction of semantic relations by hand is considered as a time-consuming and biased task, requiring a large amount of time and some help of experts. Unfortunately, the word2vec model provides an associative list of words which does not consist of relative words only. In this paper, we show some additional criteria that may be applicable to solve this problem. Observations and experiments with well-known characteristics, such as word frequency, a position in an associative list, might be useful for improving results for the task of extraction of semantic relations for the Russian language by using word embedding. In the experiments, the word2vec model trained on the Flibusta and pairs from Wiktionary are used as examples with semantic relationships. Semantically related words are applicable to thesauri, ontologies and intelligent systems for natural language processing.ΠΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎΡΡΡ ΠΈΠ΄Π΅Π½ΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ ΡΠ΅ΠΌΠ°Π½ΡΠΈΡΠ΅ΡΠΊΠΎΠΉ Π±Π»ΠΈΠ·ΠΎΡΡΠΈ ΠΌΠ΅ΠΆΠ΄Ρ ΡΠ»ΠΎΠ²Π°ΠΌΠΈ ΡΠ΄Π΅Π»Π°Π»Π° ΠΌΠΎΠ΄Π΅Π»Ρ word2vec ΡΠΈΡΠΎΠΊΠΎ ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΠ΅ΠΌΠΎΠΉ Π² NLP-Π·Π°Π΄Π°ΡΠ°Ρ
. ΠΠ΄Π΅Ρ word2vec ΠΎΡΠ½ΠΎΠ²Π°Π½Π° Π½Π° ΠΊΠΎΠ½ΡΠ΅ΠΊΡΡΠ½ΠΎΠΉ Π±Π»ΠΈΠ·ΠΎΡΡΠΈ ΡΠ»ΠΎΠ². ΠΠ°ΠΆΠ΄ΠΎΠ΅ ΡΠ»ΠΎΠ²ΠΎ ΠΌΠΎΠΆΠ΅Ρ Π±ΡΡΡ ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»Π΅Π½ΠΎ Π² Π²ΠΈΠ΄Π΅ Π²Π΅ΠΊΡΠΎΡΠ°, Π±Π»ΠΈΠ·ΠΊΠΈΠ΅ ΠΊΠΎΠΎΡΠ΄ΠΈΠ½Π°ΡΡ Π²Π΅ΠΊΡΠΎΡΠΎΠ² ΠΌΠΎΠ³ΡΡ Π±ΡΡΡ ΠΈΠ½ΡΠ΅ΡΠΏΡΠ΅ΡΠΈΡΠΎΠ²Π°Π½Ρ ΠΊΠ°ΠΊ Π±Π»ΠΈΠ·ΠΊΠΈΠ΅ ΠΏΠΎ ΡΠΌΡΡΠ»Ρ ΡΠ»ΠΎΠ²Π°. Π’Π°ΠΊΠΈΠΌ ΠΎΠ±ΡΠ°Π·ΠΎΠΌ, ΠΈΠ·Π²Π»Π΅ΡΠ΅Π½ΠΈΠ΅ ΡΠ΅ΠΌΠ°Π½ΡΠΈΡΠ΅ΡΠΊΠΈΡ
ΠΎΡΠ½ΠΎΡΠ΅Π½ΠΈΠΉ (ΠΎΡΠ½ΠΎΡΠ΅Π½ΠΈΠ΅ ΡΠΈΠ½ΠΎΠ½ΠΈΠΌΠΈΠΈ, ΡΠΎΠ΄ΠΎ-Π²ΠΈΠ΄ΠΎΠ²ΡΠ΅ ΠΎΡΠ½ΠΎΡΠ΅Π½ΠΈΡ ΠΈ Π΄ΡΡΠ³ΠΈΠ΅) ΠΌΠΎΠΆΠ΅Ρ Π±ΡΡΡ Π°Π²ΡΠΎΠΌΠ°ΡΠΈΠ·ΠΈΡΠΎΠ²Π°Π½ΠΎ. Π£ΡΡΠ°Π½ΠΎΠ²Π»Π΅Π½ΠΈΠ΅ ΡΠ΅ΠΌΠ°Π½ΡΠΈΡΠ΅ΡΠΊΠΈΡ
ΠΎΡΠ½ΠΎΡΠ΅Π½ΠΈΠΉ Π²ΡΡΡΠ½ΡΡ ΡΡΠΈΡΠ°Π΅ΡΡΡ ΡΡΡΠ΄ΠΎΠ΅ΠΌΠΊΠΎΠΉ ΠΈ Π½Π΅ΠΎΠ±ΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎΠΉ Π·Π°Π΄Π°ΡΠ΅ΠΉ, ΡΡΠ΅Π±ΡΡΡΠ΅ΠΉ Π±ΠΎΠ»ΡΡΠΎΠ³ΠΎ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²Π° Π²ΡΠ΅ΠΌΠ΅Π½ΠΈ ΠΈ ΠΏΡΠΈΠ²Π»Π΅ΡΠ΅Π½ΠΈΡ ΡΠΊΡΠΏΠ΅ΡΡΠΎΠ². ΠΠΎ ΡΡΠ΅Π΄ΠΈ Π°ΡΡΠΎΡΠΈΠ°ΡΠΈΠ²Π½ΡΡ
ΡΠ»ΠΎΠ², ΡΡΠΎΡΠΌΠΈΡΠΎΠ²Π°Π½Π½ΡΡ
Ρ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ΠΌ ΠΌΠΎΠ΄Π΅Π»ΠΈ word2vec, Π²ΡΡΡΠ΅ΡΠ°ΡΡΡΡ ΡΠ»ΠΎΠ²Π°, Π½Π΅ ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»ΡΡΡΠΈΠ΅ Π½ΠΈΠΊΠ°ΠΊΠΈΡ
ΠΎΡΠ½ΠΎΡΠ΅Π½ΠΈΠΉ Ρ Π³Π»Π°Π²Π½ΡΠΌ ΡΠ»ΠΎΠ²ΠΎΠΌ, Π΄Π»Ρ ΠΊΠΎΡΠΎΡΠΎΠ³ΠΎ Π±ΡΠ» ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»Π΅Π½ Π°ΡΡΠΎΡΠΈΠ°ΡΠΈΠ²Π½ΡΠΉ ΡΡΠ΄. Π ΡΠ°Π±ΠΎΡΠ΅ ΡΠ°ΡΡΠΌΠ°ΡΡΠΈΠ²Π°ΡΡΡΡ Π΄ΠΎΠΏΠΎΠ»Π½ΠΈΡΠ΅Π»ΡΠ½ΡΠ΅ ΠΊΡΠΈΡΠ΅ΡΠΈΠΈ, ΠΊΠΎΡΠΎΡΡΠ΅ ΠΌΠΎΠ³ΡΡ Π±ΡΡΡ ΠΏΡΠΈΠΌΠ΅Π½ΠΈΠΌΡ Π΄Π»Ρ ΡΠ΅ΡΠ΅Π½ΠΈΡ Π΄Π°Π½Π½ΠΎΠΉ ΠΏΡΠΎΠ±Π»Π΅ΠΌΡ. ΠΠ°Π±Π»ΡΠ΄Π΅Π½ΠΈΡ ΠΈ ΠΏΡΠΎΠ²Π΅Π΄Π΅Π½Π½ΡΠ΅ ΡΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½ΡΡ Ρ ΠΎΠ±ΡΠ΅ΠΈΠ·Π²Π΅ΡΡΠ½ΡΠΌΠΈ Ρ
Π°ΡΠ°ΠΊΡΠ΅ΡΠΈΡΡΠΈΠΊΠ°ΠΌΠΈ, ΡΠ°ΠΊΠΈΠΌΠΈ ΠΊΠ°ΠΊ ΡΠ°ΡΡΠΎΡΠ° ΡΠ»ΠΎΠ², ΠΏΠΎΠ·ΠΈΡΠΈΡ Π² Π°ΡΡΠΎΡΠΈΠ°ΡΠΈΠ²Π½ΠΎΠΌ ΡΡΠ΄Ρ, ΠΌΠΎΠ³ΡΡ Π±ΡΡΡ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½Ρ Π΄Π»Ρ ΡΠ»ΡΡΡΠ΅Π½ΠΈΡ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠΎΠ² ΠΏΡΠΈ ΡΠ°Π±ΠΎΡΠ΅ Ρ Π²Π΅ΠΊΡΠΎΡΠ½ΡΠΌ ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»Π΅Π½ΠΈΠ΅ΠΌ ΡΠ»ΠΎΠ² Π² ΡΠ°ΡΡΠΈ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΡ ΡΠ΅ΠΌΠ°Π½ΡΠΈΡΠ΅ΡΠΊΠΈΡ
ΠΎΡΠ½ΠΎΡΠ΅Π½ΠΈΠΉ Π΄Π»Ρ ΡΡΡΡΠΊΠΎΠ³ΠΎ ΡΠ·ΡΠΊΠ°. Π ΡΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½ΡΠ°Ρ
ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΠ΅ΡΡΡ ΠΎΠ±ΡΡΠ΅Π½Π½Π°Ρ Π½Π° ΠΊΠΎΡΠΏΡΡΠ°Ρ
Π€Π»ΠΈΠ±ΡΡΡΡ ΠΌΠΎΠ΄Π΅Π»Ρ word2vec ΠΈ ΡΠ°Π·ΠΌΠ΅ΡΠ΅Π½Π½ΡΠ΅ Π΄Π°Π½Π½ΡΠ΅ ΠΠΈΠΊΠΈΡΠ»ΠΎΠ²Π°ΡΡ Π² ΠΊΠ°ΡΠ΅ΡΡΠ²Π΅ ΠΎΠ±ΡΠ°Π·ΡΠΎΠ²ΡΡ
ΠΏΡΠΈΠΌΠ΅ΡΠΎΠ², Π² ΠΊΠΎΡΠΎΡΡΡ
ΠΎΡΡΠ°ΠΆΠ΅Π½Ρ ΡΠ΅ΠΌΠ°Π½ΡΠΈΡΠ΅ΡΠΊΠΈΠ΅ ΠΎΡΠ½ΠΎΡΠ΅Π½ΠΈΡ. Π‘Π΅ΠΌΠ°Π½ΡΠΈΡΠ΅ΡΠΊΠΈ ΡΠ²ΡΠ·Π°Π½Π½ΡΠ΅ ΡΠ»ΠΎΠ²Π° (ΠΈΠ»ΠΈ ΡΠ΅ΡΠΌΠΈΠ½Ρ) Π½Π°ΡΠ»ΠΈ ΡΠ²ΠΎΠ΅ ΠΏΡΠΈΠΌΠ΅Π½Π΅Π½ΠΈΠ΅ Π² ΡΠ΅Π·Π°ΡΡΡΡΠ°Ρ
, ΠΎΠ½ΡΠΎΠ»ΠΎΠ³ΠΈΡΡ
, ΠΈΠ½ΡΠ΅Π»Π»Π΅ΠΊΡΡΠ°Π»ΡΠ½ΡΡ
ΡΠΈΡΡΠ΅ΠΌΠ°Ρ
Π΄Π»Ρ ΠΎΠ±ΡΠ°Π±ΠΎΡΠΊΠΈ Π΅ΡΡΠ΅ΡΡΠ²Π΅Π½Π½ΠΎΠ³ΠΎ ΡΠ·ΡΠΊΠ°