32 research outputs found

    A spinning wheel for YARN : user interface for a crowdsourced thesaurus

    Full text link
    YARN (Yet Another RussNet) project started in 2013 aims at creating a large open thesaurus for Russian using crowdsourcing. This paper describes synset assembly interface developed within the project β€” motivation behind it, design, usage scenarios, implementation details, and first experimental results

    The Impact of Cross-Lingual Adjustment of Contextual Word Representations on Zero-Shot Transfer

    Get PDF
    Large multilingual language models such as mBERT or XLM-R enable zero-shot cross-lingual transfer in various IR and NLP tasks. Cao et al. (2020) proposed a data- and compute-efficient method for cross-lingual adjustment of mBERT that uses a small parallel corpus to make embeddings of related words across languages similar to each other. They showed it to be effective in NLI for five European languages. In contrast we experiment with a typologically diverse set of languages (Spanish, Russian, Vietnamese, and Hindi) and extend their original implementations to new tasks (XSR, NER, and QA) and an additional training regime (continual learning). Our study reproduced gains in NLI for four languages, showed improved NER, XSR, and cross-lingual QA results in three languages (though some cross-lingual QA gains were not statistically significant), while mono-lingual QA performance never improved and sometimes degraded. Analysis of distances between contextualized embeddings of related and unrelated words (across languages) showed that fine-tuning leads to "forgetting" some of the cross-lingual alignment information. Based on this observation, we further improved NLI performance using continual learning.Comment: Presented at ECIR 202

    KazQAD: Kazakh Open-Domain Question Answering Dataset

    Full text link
    We introduce KazQAD -- a Kazakh open-domain question answering (ODQA) dataset -- that can be used in both reading comprehension and full ODQA settings, as well as for information retrieval experiments. KazQAD contains just under 6,000 unique questions with extracted short answers and nearly 12,000 passage-level relevance judgements. We use a combination of machine translation, Wikipedia search, and in-house manual annotation to ensure annotation efficiency and data quality. The questions come from two sources: translated items from the Natural Questions (NQ) dataset (only for training) and the original Kazakh Unified National Testing (UNT) exam (for development and testing). The accompanying text corpus contains more than 800,000 passages from the Kazakh Wikipedia. As a supplementary dataset, we release around 61,000 question-passage-answer triples from the NQ dataset that have been machine-translated into Kazakh. We develop baseline retrievers and readers that achieve reasonable scores in retrieval (NDCG@10 = 0.389 MRR = 0.382), reading comprehension (EM = 38.5 F1 = 54.2), and full ODQA (EM = 17.8 F1 = 28.7) settings. Nevertheless, these results are substantially lower than state-of-the-art results for English QA collections, and we think that there should still be ample room for improvement. We also show that the current OpenAI's ChatGPTv3.5 is not able to answer KazQAD test questions in the closed-book setting with acceptable quality. The dataset is freely available under the Creative Commons licence (CC BY-SA) at https://github.com/IS2AI/KazQAD.Comment: To appear in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024

    YARN : spinning-in-progress

    Full text link
    YARN (Yet Another RussNet), a project started in 2013, aims at creating a large open WordNet-like thesaurus for Russian by means of crowdsourcing. The first stage of the project was to create noun synsets. Currently, the resource comprises 100K+ word entries and 46K+ synsets. More than 200 people have taken part in assembling synsets throughout the project. The paper describes the linguistic, technical, and organizational principles of the project, as well as the evaluation results, lessons learned, and the future plans

    LEARNING TO PREDICT CLOSED QUESTIONS ON STACK OVERFLOW // Π£Ρ‡Π΅Π½Ρ‹Π΅ записки КЀУ. Π€ΠΈΠ·ΠΈΠΊΠΎ-матСматичСскиС Π½Π°ΡƒΠΊΠΈ 2013 Ρ‚ΠΎΠΌ155 N4

    Get PDF
    Π’ ΡΡ‚Π°Ρ‚ΡŒΠ΅ рассматриваСтся Π·Π°Π΄Π°Ρ‡Π° прогнозирования вСроятности Ρ‚ΠΎΠ³ΠΎ, Ρ‡Ρ‚ΠΎ вопрос Π½Π° сСрвисС Stack Overflow - популярном вопросно-ΠΎΡ‚Π²Π΅Ρ‚Π½ΠΎΠΌ рСсурсС, посвящСнном Ρ€Π°Π·Ρ€Π°Π±ΠΎΡ‚ΠΊΠ΅ ΠΏΡ€ΠΎΠ³Ρ€Π°ΠΌΠΌΠ½ΠΎΠ³ΠΎ обСспСчСния - Π±ΡƒΠ΄Π΅Ρ‚ Π·Π°ΠΊΡ€Ρ‹Ρ‚ ΠΌΠΎΠ΄Π΅Ρ€Π°Ρ‚ΠΎΡ€ΠΎΠΌ. Π—Π°Π΄Π°Ρ‡Π°, Π΄Π°Π½Π½Ρ‹Π΅ ΠΈ ΠΌΠ΅Ρ‚Ρ€ΠΈΠΊΠ° ΠΎΡ†Π΅Π½ΠΊΠΈ качСства Π±Ρ‹Π»ΠΈ ΠΏΡ€Π΅Π΄Π»ΠΎΠΆΠ΅Π½Ρ‹ Π² Ρ€Π°ΠΌΠΊΠ°Ρ… ΠΎΡ‚ΠΊΡ€Ρ‹Ρ‚ΠΎΠ³ΠΎ конкурса ΠΏΠΎ ΠΌΠ°ΡˆΠΈΠ½Π½ΠΎΠΌΡƒ ΠΎΠ±ΡƒΡ‡Π΅Π½ΠΈΡŽ Π½Π° сСрвисС Kaggle. Π’ процСссС Ρ€Π΅ΡˆΠ΅Π½ΠΈΡ Π·Π°Π΄Π°Ρ‡ΠΈ ΠΌΡ‹ использовали ΡˆΠΈΡ€ΠΎΠΊΠΈΠΉ Π½Π°Π±ΠΎΡ€ ΠΏΡ€ΠΈΠ·Π½Π°ΠΊΠΎΠ² для классификации, Π² Ρ‚ΠΎΠΌ числС ΠΏΡ€ΠΈΠ·Π½Π°ΠΊΠΈ, ΠΎΠΏΠΈΡΡ‹Π²Π°ΡŽΡ‰ΠΈΠ΅ Π»ΠΈΡ‡Π½Ρ‹Π΅ характСристики ΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Ρ‚Π΅Π»Ρ, взаимодСйствиС ΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Ρ‚Π΅Π»Π΅ΠΉ Π΄Ρ€ΡƒΠ³ с Π΄Ρ€ΡƒΠ³ΠΎΠΌ, Π° Ρ‚Π°ΠΊΠΆΠ΅ содСрТаниС вопросов, Π² Ρ‚ΠΎΠΌ числС тСматичСскоС. Π’ процСссС классификации протСстировано нСсколько Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠΎΠ² машинного обучСния. По Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚Π°ΠΌ экспСримСнта Π±Ρ‹Π»ΠΈ выявлСны Π½Π°ΠΈΠ±ΠΎΠ»Π΅Π΅ Π²Π°ΠΆΠ½Ρ‹Π΅ ΠΏΡ€ΠΈΠ·Π½Π°ΠΊΠΈ: Π»ΠΈΡ‡Π½Ρ‹Π΅ характСристики ΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Ρ‚Π΅Π»Ρ ΠΈ тСматичСскиС ΠΏΡ€ΠΈΠ·Π½Π°ΠΊΠΈ вопроса. ΠΠ°ΠΈΠ»ΡƒΡ‡ΡˆΠΈΠ΅ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚Ρ‹ Π±Ρ‹Π»ΠΈ ΠΏΠΎΠ»ΡƒΡ‡Π΅Π½Ρ‹ с ΠΏΠΎΠΌΠΎΡ‰ΡŒΡŽ Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠ°, Ρ€Π΅Π°Π»ΠΈΠ·ΠΎΠ²Π°Π½Π½ΠΎΠ³ΠΎ Π² Π±ΠΈΠ±Π»ΠΈΠΎΡ‚Π΅ΠΊΠ΅ Vowpal Wabbit, - ΠΈΠ½Ρ‚Π΅Ρ€Π°ΠΊΡ‚ΠΈΠ²Π½ΠΎΠ³ΠΎ обучСния Π½Π° основС стохастичСского Π³Ρ€Π°Π΄ΠΈΠ΅Π½Ρ‚Π½ΠΎΠ³ΠΎ спуска. ΠΠ°ΠΈΠ»ΡƒΡ‡ΡˆΠ°Ρ получСнная Π½Π°ΠΌΠΈ ΠΎΡ†Π΅Π½ΠΊΠ° ΠΏΠΎΠΏΠ°Π΄Π°Π΅Ρ‚ Π² Ρ‚ΠΎΠΏ-5 Π»ΡƒΡ‡ΡˆΠΈΡ… Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ΠΎΠ² Π² Ρ„ΠΈΠ½Π°Π»ΡŒΠ½ΠΎΠΉ Ρ‚Π°Π±Π»ΠΈΡ†Π΅, Π½ΠΎ ΠΏΠΎΠ»ΡƒΡ‡Π΅Π½Π° послС Π΄Π°Ρ‚Ρ‹ Π·Π°Π²Π΅Ρ€ΡˆΠ΅Π½ΠΈΡ конкурса

    NEREL-BIO: A Dataset of Biomedical Abstracts Annotated with Nested Named Entities

    Full text link
    This paper describes NEREL-BIO -- an annotation scheme and corpus of PubMed abstracts in Russian and smaller number of abstracts in English. NEREL-BIO extends the general domain dataset NEREL by introducing domain-specific entity types. NEREL-BIO annotation scheme covers both general and biomedical domains making it suitable for domain transfer experiments. NEREL-BIO provides annotation for nested named entities as an extension of the scheme employed for NEREL. Nested named entities may cross entity boundaries to connect to shorter entities nested within longer entities, making them harder to detect. NEREL-BIO contains annotations for 700+ Russian and 100+ English abstracts. All English PubMed annotations have corresponding Russian counterparts. Thus, NEREL-BIO comprises the following specific features: annotation of nested named entities, it can be used as a benchmark for cross-domain (NEREL -> NEREL-BIO) and cross-language (English -> Russian) transfer. We experiment with both transformer-based sequence models and machine reading comprehension (MRC) models and report their results. The dataset is freely available at https://github.com/nerel-ds/NEREL-BIO.Comment: Submitted to Bioinformatics (Publisher: Oxford University Press

    Π’Π΅ΠΊΡ‚ΠΎΡ€Π½ΠΎΠ΅ прСдставлСниС слов с сСмантичСскими ΠΎΡ‚Π½ΠΎΡˆΠ΅Π½ΠΈΡΠΌΠΈ: ΡΠΊΡΠΏΠ΅Ρ€ΠΈΠΌΠ΅Π½Ρ‚Π°Π»ΡŒΠ½Ρ‹Π΅ наблюдСния

    Get PDF
    The ability to identify semantic relations between words has made a word2vec model widely used in NLP tasks. The idea of word2vec is based on a simple rule that a higher similarity can be reached if two words have a similar context. Each word can be represented as a vector, so the closest coordinates of vectors can be interpreted as similar words. It allows to establish semantic relations (synonymy, relations of hypernymy and hyponymy and other semantic relations) by applying an automatic extraction. The extraction of semantic relations by hand is considered as a time-consuming and biased task, requiring a large amount of time and some help of experts. Unfortunately, the word2vec model provides an associative list of words which does not consist of relative words only. In this paper, we show some additional criteria that may be applicable to solve this problem. Observations and experiments with well-known characteristics, such as word frequency, a position in an associative list, might be useful for improving results for the task of extraction of semantic relations for the Russian language by using word embedding. In the experiments, the word2vec model trained on the Flibusta and pairs from Wiktionary are used as examples with semantic relationships. Semantically related words are applicable to thesauri, ontologies and intelligent systems for natural language processing.Π’ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎΡΡ‚ΡŒ ΠΈΠ΄Π΅Π½Ρ‚ΠΈΡ„ΠΈΠΊΠ°Ρ†ΠΈΠΈ сСмантичСской близости ΠΌΠ΅ΠΆΠ΄Ρƒ словами сдСлала модСль word2vec ΡˆΠΈΡ€ΠΎΠΊΠΎ ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΠ΅ΠΌΠΎΠΉ Π² NLP-Π·Π°Π΄Π°Ρ‡Π°Ρ…. ИдСя word2vec основана Π½Π° контСкстной близости слов. КаТдоС слово ΠΌΠΎΠΆΠ΅Ρ‚ Π±Ρ‹Ρ‚ΡŒ прСдставлСно Π² Π²ΠΈΠ΄Π΅ Π²Π΅ΠΊΡ‚ΠΎΡ€Π°, Π±Π»ΠΈΠ·ΠΊΠΈΠ΅ ΠΊΠΎΠΎΡ€Π΄ΠΈΠ½Π°Ρ‚Ρ‹ Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΎΠ² ΠΌΠΎΠ³ΡƒΡ‚ Π±Ρ‹Ρ‚ΡŒ ΠΈΠ½Ρ‚Π΅Ρ€ΠΏΡ€Π΅Ρ‚ΠΈΡ€ΠΎΠ²Π°Π½Ρ‹ ΠΊΠ°ΠΊ Π±Π»ΠΈΠ·ΠΊΠΈΠ΅ ΠΏΠΎ смыслу слова. Π’Π°ΠΊΠΈΠΌ ΠΎΠ±Ρ€Π°Π·ΠΎΠΌ, ΠΈΠ·Π²Π»Π΅Ρ‡Π΅Π½ΠΈΠ΅ сСмантичСских ΠΎΡ‚Π½ΠΎΡˆΠ΅Π½ΠΈΠΉ (ΠΎΡ‚Π½ΠΎΡˆΠ΅Π½ΠΈΠ΅ синонимии, Ρ€ΠΎΠ΄ΠΎ-Π²ΠΈΠ΄ΠΎΠ²Ρ‹Π΅ ΠΎΡ‚Π½ΠΎΡˆΠ΅Π½ΠΈΡ ΠΈ Π΄Ρ€ΡƒΠ³ΠΈΠ΅) ΠΌΠΎΠΆΠ΅Ρ‚ Π±Ρ‹Ρ‚ΡŒ Π°Π²Ρ‚ΠΎΠΌΠ°Ρ‚ΠΈΠ·ΠΈΡ€ΠΎΠ²Π°Π½ΠΎ. УстановлСниС сСмантичСских ΠΎΡ‚Π½ΠΎΡˆΠ΅Π½ΠΈΠΉ Π²Ρ€ΡƒΡ‡Π½ΡƒΡŽ считаСтся Ρ‚Ρ€ΡƒΠ΄ΠΎΠ΅ΠΌΠΊΠΎΠΉ ΠΈ Π½Π΅ΠΎΠ±ΡŠΠ΅ΠΊΡ‚ΠΈΠ²Π½ΠΎΠΉ Π·Π°Π΄Π°Ρ‡Π΅ΠΉ, Ρ‚Ρ€Π΅Π±ΡƒΡŽΡ‰Π΅ΠΉ большого количСства Π²Ρ€Π΅ΠΌΠ΅Π½ΠΈ ΠΈ привлСчСния экспСртов. Но срСди ассоциативных слов, сформированных с использованиСм ΠΌΠΎΠ΄Π΅Π»ΠΈ word2vec, Π²ΡΡ‚Ρ€Π΅Ρ‡Π°ΡŽΡ‚ΡΡ слова, Π½Π΅ ΠΏΡ€Π΅Π΄ΡΡ‚Π°Π²Π»ΡΡŽΡ‰ΠΈΠ΅ Π½ΠΈΠΊΠ°ΠΊΠΈΡ… ΠΎΡ‚Π½ΠΎΡˆΠ΅Π½ΠΈΠΉ с Π³Π»Π°Π²Π½Ρ‹ΠΌ словом, для ΠΊΠΎΡ‚ΠΎΡ€ΠΎΠ³ΠΎ Π±Ρ‹Π» прСдставлСн ассоциативный ряд. Π’ Ρ€Π°Π±ΠΎΡ‚Π΅ Ρ€Π°ΡΡΠΌΠ°Ρ‚Ρ€ΠΈΠ²Π°ΡŽΡ‚ΡΡ Π΄ΠΎΠΏΠΎΠ»Π½ΠΈΡ‚Π΅Π»ΡŒΠ½Ρ‹Π΅ ΠΊΡ€ΠΈΡ‚Π΅Ρ€ΠΈΠΈ, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ ΠΌΠΎΠ³ΡƒΡ‚ Π±Ρ‹Ρ‚ΡŒ ΠΏΡ€ΠΈΠΌΠ΅Π½ΠΈΠΌΡ‹ для Ρ€Π΅ΡˆΠ΅Π½ΠΈΡ Π΄Π°Π½Π½ΠΎΠΉ ΠΏΡ€ΠΎΠ±Π»Π΅ΠΌΡ‹. НаблюдСния ΠΈ ΠΏΡ€ΠΎΠ²Π΅Π΄Π΅Π½Π½Ρ‹Π΅ экспСримСнты с общСизвСстными характСристиками, Ρ‚Π°ΠΊΠΈΠΌΠΈ ΠΊΠ°ΠΊ частота слов, позиция Π² ассоциативном ряду, ΠΌΠΎΠ³ΡƒΡ‚ Π±Ρ‹Ρ‚ΡŒ ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Π½Ρ‹ для ΡƒΠ»ΡƒΡ‡ΡˆΠ΅Π½ΠΈΡ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ΠΎΠ² ΠΏΡ€ΠΈ Ρ€Π°Π±ΠΎΡ‚Π΅ с Π²Π΅ΠΊΡ‚ΠΎΡ€Π½Ρ‹ΠΌ прСдставлСниСм слов Π² части опрСдСлСния сСмантичСских ΠΎΡ‚Π½ΠΎΡˆΠ΅Π½ΠΈΠΉ для русского языка. Π’ экспСримСнтах ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΠ΅Ρ‚ΡΡ обучСнная Π½Π° корпусах Ѐлибусты модСль word2vec ΠΈ Ρ€Π°Π·ΠΌΠ΅Ρ‡Π΅Π½Π½Ρ‹Π΅ Π΄Π°Π½Π½Ρ‹Π΅ Викисловаря Π² качСствС ΠΎΠ±Ρ€Π°Π·Ρ†ΠΎΠ²Ρ‹Ρ… ΠΏΡ€ΠΈΠΌΠ΅Ρ€ΠΎΠ², Π² ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Ρ… ΠΎΡ‚Ρ€Π°ΠΆΠ΅Π½Ρ‹ сСмантичСскиС ΠΎΡ‚Π½ΠΎΡˆΠ΅Π½ΠΈΡ. БСмантичСски связанныС слова (ΠΈΠ»ΠΈ Ρ‚Π΅Ρ€ΠΌΠΈΠ½Ρ‹) нашли своС ΠΏΡ€ΠΈΠΌΠ΅Π½Π΅Π½ΠΈΠ΅ Π² тСзаурусах, онтологиях, ΠΈΠ½Ρ‚Π΅Π»Π»Π΅ΠΊΡ‚ΡƒΠ°Π»ΡŒΠ½Ρ‹Ρ… систСмах для ΠΎΠ±Ρ€Π°Π±ΠΎΡ‚ΠΊΠΈ СстСствСнного языка
    corecore