119 research outputs found

    Initial Normalization of User Generated Content: Case Study in a Multilingual Setting

    Get PDF
    We address the problem of normalizing user generated content in a multilingual setting. Specifically, we target comment sections of popular Kazakhstani Internet news outlets, where comments almost always appear in Kazakh or Russian, or in a mixture of both. Moreover, such comments are noisy, i.e. difficult to process due to (mostly) intentional breach of spelling conventions, which aggravates data sparseness problem. Therefore, we propose a simple yet effective normalization method that accounts for multilingual input. We evaluate our approach extrinsically, on the tasks of language identification and sentiment analysis, showing that in both cases normalization improves overall accuracy

    On the development of an information system for monitoring user opinion and its role for the public

    Get PDF
    Social media services and analytics platforms are rapidly growing. A large number of various events happen mostly every day, and the role of social media monitoring tools is also increasing. Social networks are widely used for managing and promoting brands and different services. Thus, most popular social analytics platforms aim for business purposes while monitoring various social, economic, and political problems remains underrepresented and not covered by thorough research. Moreover, most of them focus on resource-rich languages such as the English language, whereas texts and comments in other low-resource languages, such as the Russian and Kazakh languages in social media, are not represented well enough. So, this work is devoted to developing and applying the information system called the OMSystem for analyzing users' opinions on news portals, blogs, and social networks in Kazakhstan. The system uses sentiment dictionaries of the Russian and Kazakh languages and machine learning algorithms to determine the sentiment of social media texts. The whole structure and functionalities of the system are also presented. The experimental part is devoted to building machine learning models for sentiment analysis on the Russian and Kazakh datasets. Then the performance of the models is evaluated with accuracy, precision, recall, and F1-score metrics. The models with the highest scores are selected for implementation in the OMSystem. Then the OMSystem's social analytics module is used to thoroughly analyze the healthcare, political and social aspects of the most relevant topics connected with the vaccination against the coronavirus disease. The analysis allowed us to discover the public social mood in the cities of Almaty and Nur-Sultan and other large regional cities of Kazakhstan. The system's study included two extensive periods: 10-01-2021 to 30-05-2021 and 01-07-2021 to 12-08-2021. In the obtained results, people's moods and attitudes to the Government's policies and actions were studied by such social network indicators as the level of topic discussion activity in society, the level of interest in the topic in society, and the mood level of society. These indicators calculated by the OMSystem allowed careful identification of alarming factors of the public (negative attitude to the government regulations, vaccination policies, trust in vaccination, etc.) and assessment of the social mood

    Investigating the Effect of Emoji in Opinion Classification of Uzbek Movie Review Comments

    Full text link
    Opinion mining on social media posts has become more and more popular. Users often express their opinion on a topic not only with words but they also use image symbols such as emoticons and emoji. In this paper, we investigate the effect of emoji-based features in opinion classification of Uzbek texts, and more specifically movie review comments from YouTube. Several classification algorithms are tested, and feature ranking is performed to evaluate the discriminative ability of the emoji-based features.Comment: 10 pages, 1 figure, 3 table

    Negativizing emotive coloronyms: A Kazakhstan-US Ethno-Psycholinguistic comparison

    Get PDF
    Neurotargeting prioritizes emotions in understanding collective unconscious and individual behavior. Comparative emotive linguistics reveals cross-cultural emotional expression variations. Despite extensive emotion research, gaps remain due to differing response norms. Psychology understands emotions well, but lacks universal classification, hindering linguistic description. Confusion between emotion and emotive obscures psychophysiological and verbal distinctions. Nonverbal emotives, reflecting emotions, require analysis of generation and expression mechanisms. This study examines color's role in conveying negative emotions in Kazakh writer A. Nurpeisov's "Blood and Sweat" and American writer T. Dreiser's "Trilogy of Desire." Authors use linguistic and nonverbal methods to portray emotions. Hypothesis: color as emotive state designation functions with "permissible-unacceptable" and "good-bad" evaluations, evident in shaping emotional reality perception. Analyzing coloristic negative emotives uncovers ethno-cultural metaphorical models, connecting emotive coloronyms with basic emotional concepts. Findings aid standardizing cognitive mechanisms for understanding mental experiences and comparative emotive linguistic terminology

    Cross-lingual Transfer Can Worsen Bias in Sentiment Analysis

    Get PDF
    Sentiment analysis (SA) systems are widely deployed in many of the world's languages, and there is well-documented evidence of demographic bias in these systems. In languages beyond English, scarcer training data is often supplemented with transfer learning using pre-trained models, including multilingual models trained on other languages. In some cases, even supervision data comes from other languages. Does cross-lingual transfer also import new biases? To answer this question, we use counterfactual evaluation to test whether gender or racial biases are imported when using cross-lingual transfer, compared to a monolingual transfer setting. Across five languages, we find that systems using cross-lingual transfer usually become more biased than their monolingual counterparts. We also find racial biases to be much more prevalent than gender biases. To spur further research on this topic, we release the sentiment models we used for this study, and the intermediate checkpoints throughout training, yielding 1,525 distinct models; we also release our evaluation code

    Sentiment Classification of Russian Texts Using Automatically Generated Thesaurus

    Get PDF
    This paper is devoted to an approach for sentiment classification of Russian texts applying an automatic thesaurus of the subject area. This approach consists of a standard machine learning classifier and a procedure embedded into it, that uses the- saurus relationships for better sentiment analysis. The thesaurus is generated fully automatically and does not require expert’s involvement into classification process. Experiments conducted with the approach and four Russian-language text corpora, show effectiveness of thesaurus application to sentiment classification

    Analyzing tourist data on Twitter: a case study in the province of Granada at Spain

    Get PDF
    This work has been funded by the Spanish Ministerio de Economía y Competitividad under project TIN2016-77902-C3-2-P, and the European Regional Development Fund (ERDF-FEDER)

    A Multilingual BPE Embedding Space for Universal Sentiment Lexicon Induction

    Get PDF
    We present a new method for sentiment lex- icon induction that is designed to be appli- cable to the entire range of typological di- versity of the world’s languages. We eval- uate our method on Parallel Bible Corpus+ (PBC+), a parallel corpus of 1593 languages. The key idea is to use Byte Pair Encodings (BPEs) as basic units for multilingual em- beddings. Through zero-shot transfer from English sentiment, we learn a seed lexicon for each language in the domain of PBC+. Through domain adaptation, we then gener- alize the domain-specific lexicon to a general one. We show – across typologically diverse languages in PBC+ – good quality of seed and general-domain sentiment lexicons by intrin- sic and extrinsic and by automatic and human evaluation. We make freely available our code, seed sentiment lexicons for all 1593 languages and induced general-domain sentiment lexi- cons for 200 language
    corecore