15 research outputs found

    Watset : automatic induction of synsets from a graph of synonyms

    Full text link
    This paper presents a new graph-based approach that induces synsets using synonymy dictionaries and word embeddings. First, we build a weighted graph of synonyms extracted from commonly available resources, such as Wiktionary. Second, we apply word sense induction to deal with ambiguous words. Finally, we cluster the disambiguated version of the ambiguous input graph into synsets. Our meta-clustering approach lets us use an efficient hard clustering algorithm to perform a fuzzy clustering of the graph. Despite its simplicity, our approach shows excellent results, outperforming five competitive state-of-the-art methods in terms of F-score on three gold standard datasets for English and Russian derived from large-scale manually constructed lexical resources

    Генерация ключевых слов для русскоязычных научных текстов с помощью модели mT5

    Get PDF
    In this work, we applied the multilingual text-to-text transformer (mT5) to the task of keyphrase generation for Russian scientific texts using the Keyphrases CS&Math Russian corpus. The automatic selection of keyphrases is a relevant task of natural language processing since keyphrases help readers find the article easily and facilitate the systematization of scientific texts. In this paper, the task of keyphrase selection is considered as a text summarization task. The mT5 model was fine-tuned on the texts of abstracts of Russian research papers. We used abstracts as an input of the model and lists of keyphrases separated with commas as an output. The results of mT5 were compared with several baselines, including TopicRank, YAKE!, RuTermExtract, and KeyBERT. The results are reported in terms of the full-match F1-score, ROUGE-1, and BERTScore. The best results on the test set were obtained by mT5 and RuTermExtract. The highest F1-score is demonstrated by mT5 (11,24 %), exceeding RuTermExtract by 0,22 %. RuTermextract shows the highest score for ROUGE-1 (15,12 %). According to BERTScore, the best results were also obtained using these methods: mT5 — 76,89 % (BERTScore using mBERT), RuTermExtract — 75,8 % (BERTScore using ruSciBERT). Moreover, we evaluated the capability of mT5 for predicting the keyphrases that are absent in the source text. The important limitations of the proposed approach are the necessity of having a training sample for fine-tuning and probably limited suitability of the fine-tuned model in cross-domain settings. The advantages of keyphrase generation using pre-trained mT5 are the absence of the need for defining the number and length of keyphrases and normalizing produced keyphrases, which is important for flective languages, and the ability to generate keyphrases that are not presented in the text explicitly.Авторами предлагается подход к генерации ключевых слов для русскоязычных научных текстов с помощью модели mT5 (multilingual text-to-text transformer), дообученнной на материале текстового корпуса Keyphrases CS&Math Russian. Автоматический подбор ключевых слов является актуальной задачей обработки естественного языка, поскольку ключевые слова помогают читателям осуществлять поиск статей и облегчают систематизацию научных текстов. В данной работе задача подбора ключевых слов рассматривается как задача автоматического реферирования текстов. Дообучение mT5 осуществлялась на текстах аннотаций русскоязычных научных статей. В качестве входных и выходных данных выступали тексты аннотаций и списки ключевых слов, разделенных запятыми, соответственно. Результаты, полученные с помощью mT5, были сравнены с результатами нескольких базовых методов: TopicRank, YAKE!, RuTermExtract, и KeyBERT. Для представления результатов использовались следующие метрики: F-мера, ROUGE-1, BERTScore. Лучшие результаты на тестовой выборке были получены с помощью mT5 и RuTermExtract. Наиболее высокое значение F-меры продемонстрировала модель mT5 (11.24 %), превзойдя RuTermExtract на 0.22 %. RuTermExtract показал лучший результат по метрике ROUGE-1 (15.12 %). Лучшие результаты по BERTScore также были достигнуты этими двумя методами: mT5 — 76.89 % (BERTScore, использующая модель mBERT), RuTermExtract — 75.8 % (BERTScore на основе ruSciBERT). Также авторами была оценена возможность mT5 генерировать ключевые слова, отсутствующие в исходном тексте. К ограничениям предложенного подхода относятся необходимость формирования обучающей выборки для дообучения модели и, вероятно, ограниченная применимость дообученной модели для текстов других предметных областей. Преимущества генерации ключевых слов с помощью mT5 — отсутствие необходимости задавать фиксированные значения длины и количества ключевых слов, необходимости проводить нормализацию, что особенно важно для флективных языков, и возможность генерировать ключевые слова, в явном виде отсутствующие в тексте

    The Palgrave Handbook of Digital Russia Studies

    Get PDF
    This open access handbook presents a multidisciplinary and multifaceted perspective on how the ‘digital’ is simultaneously changing Russia and the research methods scholars use to study Russia. It provides a critical update on how Russian society, politics, economy, and culture are reconfigured in the context of ubiquitous connectivity and accounts for the political and societal responses to digitalization. In addition, it answers practical and methodological questions in handling Russian data and a wide array of digital methods. The volume makes a timely intervention in our understanding of the changing field of Russian Studies and is an essential guide for scholars, advanced undergraduate and graduate students studying Russia today

    The Palgrave Handbook of Digital Russia Studies

    Get PDF
    This open access handbook presents a multidisciplinary and multifaceted perspective on how the ‘digital’ is simultaneously changing Russia and the research methods scholars use to study Russia. It provides a critical update on how Russian society, politics, economy, and culture are reconfigured in the context of ubiquitous connectivity and accounts for the political and societal responses to digitalization. In addition, it answers practical and methodological questions in handling Russian data and a wide array of digital methods. The volume makes a timely intervention in our understanding of the changing field of Russian Studies and is an essential guide for scholars, advanced undergraduate and graduate students studying Russia today

    Annual record no. 49

    Get PDF
    INHIGEO produces an annual publication that includes information on the commission's activities, national reports, book reviews, interviews and occasional historical articles.N

    Preface

    Get PDF

    Geographic information extraction from texts

    Get PDF
    A large volume of unstructured texts, containing valuable geographic information, is available online. This information – provided implicitly or explicitly – is useful not only for scientific studies (e.g., spatial humanities) but also for many practical applications (e.g., geographic information retrieval). Although large progress has been achieved in geographic information extraction from texts, there are still unsolved challenges and issues, ranging from methods, systems, and data, to applications and privacy. Therefore, this workshop will provide a timely opportunity to discuss the recent advances, new ideas, and concepts but also identify research gaps in geographic information extraction
    corecore