808 research outputs found
Sentiment Classiο¬cation of Russian Texts Using Automatically Generated Thesaurus
This paper is devoted to an approach for sentiment classiο¬cation of Russian texts applying an automatic thesaurus of the subject area. This approach consists of a standard machine learning classiο¬er and a procedure embedded into it, that uses the- saurus relationships for better sentiment analysis. The thesaurus is generated fully automatically and does not require expertβs involvement into classiο¬cation process. Experiments conducted with the approach and four Russian-language text corpora, show effectiveness of thesaurus application to sentiment classiο¬cation
Sentiment Classification into Three Classes Applying Multinomial Bayes Algorithm, N-grams, and Thesaurus
The paper is devoted to development of the method that classi?es texts in English and Russian by sentiments into positive, negative, and neutral. The proposed method is based on the Multinomial Naive Bayes classi?er with additional n-grams application. The classi?er is trained either on three classes, or on two contrasting classes with a threshold to separate neutral texts. Experiments with texts on various topics showed signi?cant improvement of classification quality for reviews from a particular domain. Besides, the analysis of thesaurus relationships application to sentiment classification into three classes was done, however it did not show significant improvement of the classification results
Sentiment classification of long newspaper articles based on automatically generated thesaurus with various semantic relationships
The paper describes a new approach for sentiment classification of long texts from newspapers using an automatically generated thesaurus. An important part of the proposed approach is specialized thesaurus creation and computation of term's sentiment polarities based on relationships between terms. The approach's efficiency has been proved on a corpus of articles about American immigrants. The experiments showed that the automatically created thesaurus provides better classification quality than manual ones, and generally for this task our approach outperforms existing ones
A survey on thesauri application in automatic natural language processing
This paper is devoted to investigate efficiency of thesauri use in popular natural language processing (NLP) fields: information retrieval and analysis of texts and subject areas. A thesaurus is a natural language resource that models a subject area and can reflect human expert's knowledge in many NLP tasks. The main target of this survey is to determine how much thesauri affect processing quality and where they can provide better performance. We describe studies that use different types of thesauri, discuss contribution of the thesaurus into achieved results, and propose directions for future research in the thesaurus field
Navigating multilingual news collections using automatically extracted information
We are presenting a text analysis tool set that allows analysts in various
fields to sieve through large collections of multilingual news items quickly
and to find information that is of relevance to them. For a given document
collection, the tool set automatically clusters the texts into groups of
similar articles, extracts names of places, people and organisations, lists the
user-defined specialist terms found, links clusters and entities, and generates
hyperlinks. Through its daily news analysis operating on thousands of articles
per day, the tool also learns relationships between people and other entities.
The fully functional prototype system allows users to explore and navigate
multilingual document collections across languages and time.Comment: This paper describes the main functionality of the JRC's
fully-automatic news analysis system NewsExplorer, which is freely accessible
in currently thirteen languages at http://press.jrc.it/NewsExplorer/ . 8
page
An Urdu semantic tagger - lexicons, corpora, methods and tools
Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as Natural Language Processing (NLP), corpus linguistics, data sciences, etc. An important aspect of such automatic information extraction and analysis is the semantic annotation of language data using semantic annotation tool (a.k.a semantic tagger). Generally, different semantic annotation tools have been designed to carry out various levels of semantic annotations, for instance, sentiment analysis, word sense disambiguation, content analysis, semantic role labelling, etc. These semantic annotation tools identify or tag partial core semantic information of language data, moreover, they tend to be applicable only for English and other European languages. A semantic annotation tool that can annotate semantic senses of all lexical units (words) is still desirable for the Urdu language based on USAS (the UCREL Semantic Analysis System) semantic taxonomy, in order to provide comprehensive semantic analysis of Urdu language text. This research work report on the development of an Urdu semantic tagging tool and discuss challenging issues which have been faced in this Ph.D. research work. Since standard NLP pipeline tools are not widely available for Urdu, alongside the Urdu semantic tagger a suite of newly developed tools have been created: sentence tokenizer, word tokenizer and part-of-speech tagger. Results for these proposed tools are as follows: word tokenizer reports of 94.01\%, and accuracy of 97.21\%, sentence tokenizer shows F of 92.59\%, and accuracy of 93.15\%, whereas, POS tagger shows an accuracy of 95.14\%. The Urdu semantic tagger incorporates semantic resources (lexicon and corpora) as well as semantic field disambiguation methods. In terms of novelty, the NLP pre-processing tools are developed either using rule-based, statistical, or hybrid techniques. Furthermore, all semantic lexicons have been developed using a novel combination of automatic or semi-automatic approaches: mapping, crowdsourcing, statistical machine translation, GIZA++, word embeddings, and named entity. A large multi-target annotated corpus is also constructed using a semi-automatic approach to test accuracy of the Urdu semantic tagger, proposed corpus is also used to train and test supervised multi-target Machine Learning classifiers. The results show that Random k-labEL Disjoint Pruned Sets and Classifier Chain multi-target classifiers outperform all other classifiers on the proposed corpus with a Hamming Loss of 0.06\% and Accuracy of 0.94\%. The best lexical coverage of 88.59\%, 99.63\%, 96.71\% and 89.63\% are obtained on several test corpora. The developed Urdu semantic tagger shows encouraging precision on the proposed test corpus of 79.47\%
ΠΠ½Π°Π»ΠΈΠ· ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΡ ΡΠ°Π·Π»ΠΈΡΠ½ΡΡ ΡΠΈΠΏΠΎΠ² ΡΠ²ΡΠ·Π΅ΠΉ ΠΌΠ΅ΠΆΠ΄Ρ ΡΠ΅ΡΠΌΠΈΠ½Π°ΠΌΠΈ ΡΠ΅Π·Π°ΡΡΡΡΠ°, ΡΠ³Π΅Π½Π΅ΡΠΈΡΠΎΠ²Π°Π½Π½ΠΎΠ³ΠΎ Ρ ΠΏΠΎΠΌΠΎΡΡΡ Π³ΠΈΠ±ΡΠΈΠ΄Π½ΡΡ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ², Π² Π·Π°Π΄Π°ΡΠ°Ρ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ ΡΠ΅ΠΊΡΡΠΎΠ²
The main purpose of the article is to analyze how effectively different types of thesaurus relations can be used for solutions of text classification tasks. The basis of the study is an automatically generated thesaurus of a subject area, that contains three types of relations: synonymous, hierarchical and associative. To generate the thesaurus the authors use a hybrid method based on several linguistic and statistical algorithms for extraction of semantic relations. The method allows to create a thesaurus with a sufficiently large number of terms and relations among them. The authors consider two problems: topical text classification and sentiment classification of large newspaper articles. To solve them, the authors developed two approaches that complement standard algorithms with a procedure that take into account thesaurus relations to determine semantic features of texts. The approach to topical classification includes the standard unsupervised BM25 algorithm and the procedure, that take into account synonymous and hierarchical relations of the thesaurus of the subject area. The approach to sentiment classification consists of two steps. At the first step, a thesaurus is created, whose termsΒ weight polarities are calculated depending on the term occurrences in the training set or on the weights of related thesaurus terms. At the second step, the thesaurus is used to compute the features of words from texts and to classify texts by the algorithm SVM or Naive Bayes. In experiments with text corpora BBCSport, Reuters, PubMed and the corpus of articles about American immigrants, the authors varied the types of thesaurus relations that are involved in the classification and the degree of their use. The results of the experiments make it possible to evaluate the efficiency of the application of thesaurus relations for classification of raw texts and to determine under what conditions certain relationships affect more or less. In particular, the most useful thesaurus connections are synonymous and hierarchical, as they provide a better quality of classification.Β Π¦Π΅Π»Ρ Π΄Π°Π½Π½ΠΎΠΉ ΡΡΠ°ΡΡΠΈ β ΠΏΡΠΎΠ°Π½Π°Π»ΠΈΠ·ΠΈΡΠΎΠ²Π°ΡΡ, Π½Π°ΡΠΊΠΎΠ»ΡΠΊΠΎ ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎ ΠΌΠΎΠ³ΡΡ ΠΏΡΠΈΠΌΠ΅Π½ΡΡΡΡΡ ΡΠ°Π·Π»ΠΈΡΠ½ΡΠ΅ ΡΠΈΠΏΡ ΡΠ΅Π·Π°ΡΡΡΡΠ½ΡΡ
ΡΠ²ΡΠ·Π΅ΠΉ Π² Π·Π°Π΄Π°ΡΠ°Ρ
ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ ΡΠ΅ΠΊΡΡΠΎΠ². ΠΡΠ½ΠΎΠ²ΠΎΠΉ ΠΈΡΡΠ»Π΅Π΄ΠΎΠ²Π°Π½ΠΈΡ ΡΠ²Π»ΡΠ΅ΡΡΡ Π°Π²ΡΠΎΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΈ ΡΠ³Π΅Π½Π΅ΡΠΈΡΠΎΠ²Π°Π½Π½ΡΠΉ ΡΠ΅Π·Π°ΡΡΡΡ ΠΏΡΠ΅Π΄ΠΌΠ΅ΡΠ½ΠΎΠΉ ΠΎΠ±Π»Π°ΡΡΠΈ, ΡΠΎΠ΄Π΅ΡΠΆΠ°ΡΠΈΠΉ ΡΡΠΈ ΡΠΈΠΏΠ° ΡΠ²ΡΠ·Π΅ΠΉ: ΡΠΈΠ½ΠΎΠ½ΠΈΠΌΠΈΡΠ΅ΡΠΊΠΈΠ΅, ΠΈΠ΅ΡΠ°ΡΡ
ΠΈΡΠ΅ΡΠΊΠΈΠ΅ ΠΈ Π°ΡΡΠΎΡΠΈΠ°ΡΠΈΠ²Π½ΡΠ΅. ΠΠ»Ρ Π³Π΅Π½Π΅ΡΠ°ΡΠΈΠΈ ΡΠ΅Π·Π°ΡΡΡΡΠ° ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΠ΅ΡΡΡ Π³ΠΈΠ±ΡΠΈΠ΄Π½ΡΠΉ ΠΌΠ΅ΡΠΎΠ΄, ΠΎΡΠ½ΠΎΠ²Π°Π½Π½ΡΠΉ Π½Π° Π½Π΅ΡΠΊΠΎΠ»ΡΠΊΠΈΡ
Π»ΠΈΠ½Π³Π²ΠΈΡΡΠΈΡΠ΅ΡΠΊΠΈΡ
ΠΈ ΡΡΠ°ΡΠΈΡΡΠΈΡΠ΅ΡΠΊΠΈΡ
Π°Π»Π³ΠΎΡΠΈΡΠΌΠ°Ρ
Π²ΡΠ΄Π΅Π»Π΅Π½ΠΈΡ ΡΠ΅ΠΌΠ°Π½ΡΠΈΡΠ΅ΡΠΊΠΈΡ
ΡΠ²ΡΠ·Π΅ΠΉ ΠΈ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΡΡΠΈΠΉ ΡΠΎΠ·Π΄Π°ΡΡ ΡΠ΅Π·Π°ΡΡΡΡ Ρ Π΄ΠΎΡΡΠ°ΡΠΎΡΠ½ΠΎ Π±ΠΎΠ»ΡΡΠΈΠΌ ΡΠΈΡΠ»ΠΎΠΌ ΡΠ΅ΡΠΌΠΈΠ½ΠΎΠ² ΠΈ ΡΠ²ΡΠ·Π΅ΠΉ ΠΌΠ΅ΠΆΠ΄Ρ Π½ΠΈΠΌΠΈ. ΠΠ²ΡΠΎΡΡ ΡΠ°ΡΡΠΌΠ°ΡΡΠΈΠ²Π°ΡΡ Π΄Π²Π΅ Π·Π°Π΄Π°ΡΠΈ: ΡΠ΅ΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠ°Ρ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΡ ΡΠ΅ΠΊΡΡΠΎΠ² ΠΈ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΡ Π±ΠΎΠ»ΡΡΠΈΡ
Π½ΠΎΠ²ΠΎΡΡΠ½ΡΡ
ΡΡΠ°ΡΠ΅ΠΉ ΠΏΠΎ ΡΠΎΠ½Π°Π»ΡΠ½ΠΎΡΡΠΈ. ΠΠ»Ρ ΡΠ΅ΡΠ΅Π½ΠΈΡ ΠΊΠ°ΠΆΠ΄ΠΎΠΉ ΠΈΠ· Π½ΠΈΡ
Π°Π²ΡΠΎΡΠ°ΠΌΠΈ Π±ΡΠ»ΠΈ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½Ρ Π΄Π²Π° ΠΏΠΎΠ΄Ρ
ΠΎΠ΄Π°, ΠΊΠ°ΠΆΠ΄ΡΠΉ ΠΈΠ· ΠΊΠΎΡΠΎΡΡΡ
Π΄ΠΎΠΏΠΎΠ»Π½ΡΠ΅Ρ ΡΡΠ°Π½Π΄Π°ΡΡΠ½ΡΠ΅ Π°Π»Π³ΠΎΡΠΈΡΠΌΡ ΠΏΡΠΎΡΠ΅Π΄ΡΡΠΎΠΉ, ΠΏΡΠΈΠΌΠ΅Π½ΡΡΡΠ΅ΠΉ ΡΠ²ΡΠ·ΠΈ ΡΠ΅Π·Π°ΡΡΡΡΠ° Π΄Π»Ρ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΡ ΡΠ΅ΠΌΠ°Π½ΡΠΈΡΠ΅ΡΠΊΠΈΡ
ΠΎΡΠΎΠ±Π΅Π½Π½ΠΎΡΡΠ΅ΠΉ ΡΠ΅ΠΊΡΡΠΎΠ². ΠΠΎΠ΄Ρ
ΠΎΠ΄ ΠΊ ΡΠ΅ΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ Π²ΠΊΠ»ΡΡΠ°Π΅Ρ Π² ΡΠ΅Π±Ρ ΡΡΠ°Π½Π΄Π°ΡΡΠ½ΡΠΉ Π°Π»Π³ΠΎΡΠΈΡΠΌ BM25 Π²ΠΈΠ΄Π° Β«ΠΎΠ±ΡΡΠ΅Π½ΠΈΠ΅ Π±Π΅Π· ΡΡΠΈΡΠ΅Π»ΡΒ» ΠΈ ΠΏΡΠΎΡΠ΅Π΄ΡΡΡ, ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΡΡΡΡ ΡΠΈΠ½ΠΎΠ½ΠΈΠΌΠΈΡΠ΅ΡΠΊΠΈΠ΅ ΠΈ ΠΈΠ΅ΡΠ°ΡΡ
ΠΈΡΠ΅ΡΠΊΠΈΠ΅ ΡΠ²ΡΠ·ΠΈ ΡΠ΅Π·Π°ΡΡΡΡΠ° ΠΏΡΠ΅Π΄ΠΌΠ΅ΡΠ½ΠΎΠΉ ΠΎΠ±Π»Π°ΡΡΠΈ. ΠΠΎΠ΄Ρ
ΠΎΠ΄ ΠΊ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ ΠΏΠΎ ΡΠΎΠ½Π°Π»ΡΠ½ΠΎΡΡΠΈ ΡΠΎΡΡΠΎΠΈΡ ΠΈΠ· Π΄Π²ΡΡ
ΡΠ°Π³ΠΎΠ². ΠΠ° ΠΏΠ΅ΡΠ²ΠΎΠΌ ΡΠ°Π³Π΅ ΡΠΎΠ·Π΄Π°Π΅ΡΡΡ ΡΠ΅Π·Π°ΡΡΡΡ, ΡΠΎΠ½Π°Π»ΡΠ½ΡΠ΅ Π²Π΅ΡΠ° ΡΠ΅ΡΠΌΠΈΠ½ΠΎΠ² ΠΊΠΎΡΠΎΡΠΎΠ³ΠΎ ΡΡΠΈΡΠ°ΡΡΡΡ Π² Π·Π°Π²ΠΈΡΠΈΠΌΠΎΡΡΠΈ ΠΎΡ ΡΠ°ΡΡΠΎΡΡ Π²ΡΡΡΠ΅ΡΠ°Π΅ΠΌΠΎΡΡΠΈ Π² ΠΎΠ±ΡΡΠ°Π΅ΠΌΠΎΠΉ Π²ΡΠ±ΠΎΡΠΊΠ΅ ΠΈΠ»ΠΈ ΠΎΡ Π²Π΅ΡΠ° ΡΠΎΡΠ΅Π΄Π΅ΠΉ ΠΏΠΎ ΡΠ΅Π·Π°ΡΡΡΡΡ. ΠΠ° Π²ΡΠΎΡΠΎΠΌ ΡΠ°Π³Π΅ ΡΠ΅Π·Π°ΡΡΡΡ ΠΏΡΠΈΠΌΠ΅Π½ΡΠ΅ΡΡΡ Π΄Π»Ρ Π²ΡΡΠΈΡΠ»Π΅Π½ΠΈΡ ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ² ΡΠ»ΠΎΠ² ΠΈΠ· ΡΠ΅ΠΊΡΡΠΎΠ² ΠΈ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ ΡΠ΅ΠΊΡΡΠΎΠ² ΠΌΠ΅ΡΠΎΠ΄ΠΎΠΌ ΠΎΠΏΠΎΡΠ½ΡΡ
Π²Π΅ΠΊΡΠΎΡΠΎΠ² ΠΈΠ»ΠΈ Π½Π°ΠΈΠ²Π½ΡΠΌ Π±Π°ΠΉΠ΅ΡΠΎΠ²ΡΠΊΠΈΠΌ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΎΡΠΎΠΌ. Π ΡΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½ΡΠ°Ρ
Ρ ΠΊΠΎΡΠΏΡΡΠ°ΠΌΠΈ BBCSport, Reuters, PubMed ΠΈ ΠΊΠΎΡΠΏΡΡΠΎΠΌ ΡΡΠ°ΡΠ΅ΠΉ ΠΎΠ± Π°ΠΌΠ΅ΡΠΈΠΊΠ°Π½ΡΠΊΠΈΡ
ΠΈΠΌΠΌΠΈΠ³ΡΠ°Π½ΡΠ°Ρ
Π°Π²ΡΠΎΡΡ Π²Π°ΡΡΠΈΡΠΎΠ²Π°Π»ΠΈ ΡΠΈΠΏΡ ΡΠ²ΡΠ·Π΅ΠΉ, ΠΊΠΎΡΠΎΡΡΠ΅ ΡΡΠ°ΡΡΠ²ΡΡΡ Π² ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ, ΠΈ ΡΡΠ΅ΠΏΠ΅Π½Ρ ΠΈΡ
ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΡ. Π Π΅Π·ΡΠ»ΡΡΠ°ΡΡ ΡΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½ΡΠΎΠ² ΠΏΠΎΠ·Π²ΠΎΠ»ΡΡΡ ΠΎΡΠ΅Π½ΠΈΡΡ ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎΡΡΡ ΠΏΡΠΈΠΌΠ΅Π½Π΅Π½ΠΈΡ ΡΠ΅Π·Π°ΡΡΡΡΠ½ΡΡ
ΡΠ²ΡΠ·Π΅ΠΉ Π΄Π»Ρ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ ΡΠ΅ΠΊΡΡΠΎΠ² Π½Π° Π΅ΡΡΠ΅ΡΡΠ²Π΅Π½Π½ΠΎΠΌ ΡΠ·ΡΠΊΠ΅ ΠΈ ΠΎΠΏΡΠ΅Π΄Π΅Π»ΠΈΡΡ, ΠΏΡΠΈ ΠΊΠ°ΠΊΠΈΡ
ΡΡΠ»ΠΎΠ²ΠΈΡΡ
ΡΠ΅ ΠΈΠ»ΠΈ ΠΈΠ½ΡΠ΅ ΡΠ²ΡΠ·ΠΈ ΠΈΠΌΠ΅ΡΡ Π±ΠΎΠ»ΡΡΡΡ Π·Π½Π°ΡΠΈΠΌΠΎΡΡΡ. Π ΡΠ°ΡΡΠ½ΠΎΡΡΠΈ, Π½Π°ΠΈΠ±ΠΎΠ»Π΅Π΅ ΠΏΠΎΠ»Π΅Π·Π½ΡΠΌΠΈ ΡΠ΅Π·Π°ΡΡΡΡΠ½ΡΠΌΠΈ ΡΠ²ΡΠ·ΡΠΌΠΈ ΠΎΠΊΠ°Π·Π°Π»ΠΈΡΡ ΡΠΈΠ½ΠΎΠ½ΠΈΠΌΠΈΡΠ΅ΡΠΊΠΈΠ΅ ΠΈ ΠΈΠ΅ΡΠ°ΡΡ
ΠΈΡΠ΅ΡΠΊΠΈΠ΅, ΡΠ°ΠΊ ΠΊΠ°ΠΊ ΠΎΠ½ΠΈ ΠΎΠ±Π΅ΡΠΏΠ΅ΡΠΈΠ²Π°Π΅Ρ Π»ΡΡΡΠ΅Π΅ ΠΊΠ°ΡΠ΅ΡΡΠ²ΠΎ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ.
The Palgrave Handbook of Digital Russia Studies
This open access handbook presents a multidisciplinary and multifaceted perspective on how the βdigitalβ is simultaneously changing Russia and the research methods scholars use to study Russia. It provides a critical update on how Russian society, politics, economy, and culture are reconfigured in the context of ubiquitous connectivity and accounts for the political and societal responses to digitalization. In addition, it answers practical and methodological questions in handling Russian data and a wide array of digital methods. The volume makes a timely intervention in our understanding of the changing field of Russian Studies and is an essential guide for scholars, advanced undergraduate and graduate students studying Russia today
- β¦