Search CORE

5 research outputs found

Sentiment Classiﬁcation of Russian Texts Using Automatically Generated Thesaurus

Author: Ilya Paramonov
Ksenia Lagutina
Nadezhda Lagutina
Vladislav Larionov
Vladislav Petryakov
Publication venue: FRUCT
Publication date: 01/11/2018
Field of study

This paper is devoted to an approach for sentiment classiﬁcation of Russian texts applying an automatic thesaurus of the subject area. This approach consists of a standard machine learning classiﬁer and a procedure embedded into it, that uses the- saurus relationships for better sentiment analysis. The thesaurus is generated fully automatically and does not require expert’s involvement into classiﬁcation process. Experiments conducted with the approach and four Russian-language text corpora, show effectiveness of thesaurus application to sentiment classiﬁcation

Directory of Open Access Journals

Методические аспекты выделения семантических отношений для автоматической генерации специализированных тезаурусов и их оценки

Author: E. Mamedov I.
I. Paramonov V.
K. Lagutina V.
N. Lagutina S.
И. Парамонов В.
К. Лагутина В.
Н. Лагутина С.
Э. Мамедов И.
Publication venue: 'P.G. Demidov Yaroslavl State University'
Publication date: 20/12/2016
Field of study

The paper is devoted to analysis of methods for automatic generation of a specialized thesaurus. The main algorithm of generation consists of three stages: selection and preprocessing of a text corpus, recognition of thesaurus terms, and extraction of relations among terms. Our work is focused on exploring methods for semantic relation extraction. We developed a test bench that allow to test well-known algorithms for extraction of synonyms and hypernyms. These algorithms are based on diﬀerent relation extraction techniques: lexico-syntactic patterns, morpho-syntactic rules, measurement of term information quantity, general-purpose thesaurus WordNet, and Levenstein distance. For analysis of the result thesaurus we proposed a complex assessment that includes the following metrics: precision of extracted terms, precision and recall of hierarchical and synonym relations, and characteristics of the thesaurus graph (the number of extracted terms and semantic relationships of diﬀerent types, the number of connected components, and the number of vertices in the largest component). The proposed set of metrics allows to evaluate the quality of the thesaurus as a whole, reveal some drawbacks of standard relation extraction methods, and create more eﬃcient hybrid methods that can generate thesauri with better characteristics than thesauri generated by using separate methods. In order to illustrate this fact, one of such hybrid methods is considered in the paper. It combines the best standard algorithms for hypernym and synonym extraction and generates a specialized medical thesaurus. The hybrid method leaves the thesaurus quality on the same level and ﬁnds more relations between terms than well-known algorithms.Работа посвящена анализу методов автоматической генерации специализированного тезауруса. Основной алгоритм генерации состоит из трех шагов: отбор и предварительная обработка корпуса текстов, формирование множества терминов для включения в тезаурус и выделение связей между терминами тезауруса. Данное исследование сфокусировано на изучении методов выделения семантических связей, для чего авторами был разработан программный стенд, который позволяет протестировать распространенные алгоритмы выделения гиперонимов и синонимов, использующие в своей работе лексико-синтаксические шаблоны, морфо-синтаксические правила, количество информации терминов, тезаурус общего назначения WordNet и расстояние Левенштейна. Для анализа результирующего тезауруса, созданного на стенде, авторами была разработана комплексная оценка, содержащая следующие характеристики качества: точность выделения терминов, точность и полнота выделения синонимических и гиперонимических связей, а также метрики графа тезауруса (количество выделенных терминов, количество семантических связей различных типов, число компонент связности и число вершин в наибольшей компоненте). Предлагаемый набор метрик позволяет оценить качество тезауруса в целом, выявить отдельные недостатки стандартных методов выделения связей и построить более эффективные гибридные методы, генерирующие тезаурус с лучшими характеристиками по сравнению с тезаурусами, генерируемыми при использовании отдельных методов. Для иллюстрации данного факта в статье рассмотрен один из таких гибридных методов. Он комбинирует лучшие стандартные алгоритмы построения гиперонимических и синонимических связей и строит специализированный тезаурус в области медицины с тем же уровнем качества, что и другие методы, но с большим количеством связей между терминами

Modeling and Analysis of Information Systems / Моделирование и анализ информационных систем (МАИС)

A survey on thesauri application in automatic natural language processing

Author: Andrey Vasilyev
Ilya Paramonov
Ivan Shchitov
Ksenia Lagutina
Nadezhda Lagutina
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/11/2017
Field of study

This paper is devoted to investigate efficiency of thesauri use in popular natural language processing (NLP) fields: information retrieval and analysis of texts and subject areas. A thesaurus is a natural language resource that models a subject area and can reflect human expert's knowledge in many NLP tasks. The main target of this survey is to determine how much thesauri affect processing quality and where they can provide better performance. We describe studies that use different types of thesauri, discuss contribution of the thesaurus into achieved results, and propose directions for future research in the thesaurus field

Directory of Open Access Journals

Русскоязычные тезаурусы: автоматизированное построение и применение в задачах обработки текстов на естественном языке

Author: Aleksey Adrianov S.
Ilya Paramonov V.
Ksenia Lagutina V.
Nadezhda Lagutina S.
Алексей Адрианов Сергеевич
Илья Парамонов Вячеславович
Ксения Лагутина Владимировна
Надежда Лагутина Станиславовна
Publication venue: 'P.G. Demidov Yaroslavl State University'
Publication date: 27/08/2018
Field of study

The paper reviews the existing Russian-language thesauri in digital form and methods of their automatic construction and application. The authors analyzed the main characteristics of open access thesauri for scientific research, evaluated trends of their development, and their effectiveness in solving natural language processing tasks. The statistical and linguistic methods of thesaurus construction that allow to automate the development and reduce labor costs of expert linguists were studied. In particular, the authors considered algorithms for extracting keywords and semantic thesaurus relationships of all types, as well as the quality of thesauri generated with the use of these tools. To illustrate features of various methods for constructing thesaurus relationships, the authors developed a combined method that generates a specialized thesaurus fully automatically taking into account a text corpus in a particular domain and several existing linguistic resources. With the proposed method, experiments were conducted with two Russian-language text corpora from two subject areas: articles about migrants and tweets. The resulting thesauri were assessed by using an integrated assessment developed in the previous authors’ study that allows to analyze various aspects of the thesaurus and the quality of the generation methods. The analysis revealed the main advantages and disadvantages of various approaches to the construction of thesauri and the extraction of semantic relationships of different types, as well as made it possible to determine directions for future study.В работе выполнен обзор существующих электронных русскоязычных тезаурусов и методов их автоматического построения и применения. Авторы провели анализ основных характеристик тезаурусов, находящихся в открытом доступе, для научных исследований, оценили динамику их развития и эффективность в решении задач по обработке естественного языка. Были исследованы статистические и лингвистические методы построения тезаурусов, которые позволяют автоматизировать разработку и уменьшить затраты на труд экспертов-лингвистов. В частности, рассматривались алгоритмы выделения ключевых терминов из текстов и семантических тезаурусных связей всех типов, а также качество применения получившихся в результате их работы тезаурусов. Для наглядной иллюстрации особенностей различных методов построения тезаурусных связей был разработан комбинированный метод, генерирующий специализированный тезаурус полностью автоматически на основе корпуса текстов предметной области и нескольких существующих лингвистических ресурсов. С использованием предложенного метода были проведены эксперименты с русскоязычными корпусами текстов из двух предметных областей: статьи о мигрантах и твиты. Для анализа полученных тезаурусов использовалась комплексная оценка, разработанная авторами в предыдущем исследовании, которая позволяет определить различные аспекты тезауруса и качество методов его генерации. Проведённый анализ выявил основные достоинства и недостатки различных подходов к построению тезаурусов и выделению семантических связей различных типов, а также позволил определить потенциальные направления будущих исследований.

Modeling and Analysis of Information Systems / Моделирование и анализ информационных систем (МАИС)

Keywords at Work: Investigating Keyword Extraction in Social Media Applications

Author: Lahiri Shibamouli
Publication venue
Publication date: 01/01/2018
Field of study

This dissertation examines a long-standing problem in Natural Language Processing (NLP) -- keyword extraction -- from a new angle. We investigate how keyword extraction can be formulated on social media data, such as emails, product reviews, student discussions, and student statements of purpose. We design novel graph-based features for supervised and unsupervised keyword extraction from emails, and use the resulting system with success to uncover patterns in a new dataset -- student statements of purpose. Furthermore, the system is used with new features on the problem of usage expression extraction from product reviews, where we obtain interesting insights. The system while used on student discussions, uncover new and exciting patterns. While each of the above problems is conceptually distinct, they share two key common elements -- keywords and social data. Social data can be messy, hard-to-interpret, and not easily amenable to existing NLP resources. We show that our system is robust enough in the face of such challenges to discover useful and important patterns. We also show that the problem definition of keyword extraction itself can be expanded to accommodate new and challenging research questions and datasets.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/145929/1/lahiri_1.pd

Deep Blue Documents at the University of Michigan