345 research outputs found

    Initial Normalization of User Generated Content: Case Study in a Multilingual Setting

    Get PDF
    We address the problem of normalizing user generated content in a multilingual setting. Specifically, we target comment sections of popular Kazakhstani Internet news outlets, where comments almost always appear in Kazakh or Russian, or in a mixture of both. Moreover, such comments are noisy, i.e. difficult to process due to (mostly) intentional breach of spelling conventions, which aggravates data sparseness problem. Therefore, we propose a simple yet effective normalization method that accounts for multilingual input. We evaluate our approach extrinsically, on the tasks of language identification and sentiment analysis, showing that in both cases normalization improves overall accuracy

    DEEP LEARNING MODEL FOR BILINGUAL SENTIMENT CLASSIFICATION OF SHORT TEXTS

    Get PDF
    Sentiment analysis of short texts such as Twitter messages and comments in news portals is challenging due to the lack of contextual information. We propose a deep neural network model that uses bilingual word embeddings to effectively solve sentiment classification problem for a given pair of languages. We apply our approach to two corpora of two different language pairs: English-Russian and Russian-Kazakh. We show how to train a classifier in one language and predict in another. Our approach achieves 73% accuracy for English and 74% accuracy for Russian. For Kazakh sentiment analysis, we propose a baseline method, that achieves 60% accuracy; and a method to learn bilingual embeddings from a large unlabeled corpus using a bilingual word pairs

    Investigating the Effect of Emoji in Opinion Classification of Uzbek Movie Review Comments

    Full text link
    Opinion mining on social media posts has become more and more popular. Users often express their opinion on a topic not only with words but they also use image symbols such as emoticons and emoji. In this paper, we investigate the effect of emoji-based features in opinion classification of Uzbek texts, and more specifically movie review comments from YouTube. Several classification algorithms are tested, and feature ranking is performed to evaluate the discriminative ability of the emoji-based features.Comment: 10 pages, 1 figure, 3 table

    Sentiment Classification of Russian Texts Using Automatically Generated Thesaurus

    Get PDF
    This paper is devoted to an approach for sentiment classification of Russian texts applying an automatic thesaurus of the subject area. This approach consists of a standard machine learning classifier and a procedure embedded into it, that uses the- saurus relationships for better sentiment analysis. The thesaurus is generated fully automatically and does not require expert’s involvement into classification process. Experiments conducted with the approach and four Russian-language text corpora, show effectiveness of thesaurus application to sentiment classification

    Ontology engineering of automatic text processing methods

    Get PDF
    Currently, ontologies are recognized as the most effective means of formalizing and systematizing knowledge and data in scientific subject area (SSA). Practice has shown that using ontology design patterns is effective in developing the ontology of scientific subject areas. This is due to the fact that scientific subject areas ontology, as a rule, contains a large number of typical fragments that are well described by patterns of ontology design. In the paper, we present an approach to ontology engineering of automatic text processing methods based on ontology design patterns. In order to get an ontology that would describe automatic text processing sufficiently fully, it is required to process a large number of scientific publications and information resources containing information from modeling area. It is possible to facilitate and speed up the process of updating ontology with information from such sources by using lexical and syntactic patterns of ontology design. Our ontology of automatic text processing will become the conceptual basis of an intelligent information resource on modern methods of automatic text processing, which will provide systematization of all information on these methods, its integration into a single information space, convenient navigation through it, as well as meaningful access to it

    Negativizing emotive coloronyms: A Kazakhstan-US Ethno-Psycholinguistic comparison

    Get PDF
    Neurotargeting prioritizes emotions in understanding collective unconscious and individual behavior. Comparative emotive linguistics reveals cross-cultural emotional expression variations. Despite extensive emotion research, gaps remain due to differing response norms. Psychology understands emotions well, but lacks universal classification, hindering linguistic description. Confusion between emotion and emotive obscures psychophysiological and verbal distinctions. Nonverbal emotives, reflecting emotions, require analysis of generation and expression mechanisms. This study examines color's role in conveying negative emotions in Kazakh writer A. Nurpeisov's "Blood and Sweat" and American writer T. Dreiser's "Trilogy of Desire." Authors use linguistic and nonverbal methods to portray emotions. Hypothesis: color as emotive state designation functions with "permissible-unacceptable" and "good-bad" evaluations, evident in shaping emotional reality perception. Analyzing coloristic negative emotives uncovers ethno-cultural metaphorical models, connecting emotive coloronyms with basic emotional concepts. Findings aid standardizing cognitive mechanisms for understanding mental experiences and comparative emotive linguistic terminology

    MEGA: Multilingual Evaluation of Generative AI

    Full text link
    Generative AI models have shown impressive performance on many Natural Language Processing tasks such as language understanding, reasoning, and language generation. An important question being asked by the AI community today is about the capabilities and limits of these models, and it is clear that evaluating generative AI is very challenging. Most studies on generative LLMs have been restricted to English and it is unclear how capable these models are at understanding and generating text in other languages. We present the first comprehensive benchmarking of generative LLMs - MEGA, which evaluates models on standard NLP benchmarks, covering 16 NLP datasets across 70 typologically diverse languages. We compare the performance of generative LLMs including Chat-GPT and GPT-4 to State of the Art (SOTA) non-autoregressive models on these tasks to determine how well generative models perform compared to the previous generation of LLMs. We present a thorough analysis of the performance of models across languages and tasks and discuss challenges in improving the performance of generative LLMs on low-resource languages. We create a framework for evaluating generative LLMs in the multilingual setting and provide directions for future progress in the field.Comment: EMNLP 202

    A Multilingual BPE Embedding Space for Universal Sentiment Lexicon Induction

    Get PDF
    We present a new method for sentiment lex- icon induction that is designed to be appli- cable to the entire range of typological di- versity of the world’s languages. We eval- uate our method on Parallel Bible Corpus+ (PBC+), a parallel corpus of 1593 languages. The key idea is to use Byte Pair Encodings (BPEs) as basic units for multilingual em- beddings. Through zero-shot transfer from English sentiment, we learn a seed lexicon for each language in the domain of PBC+. Through domain adaptation, we then gener- alize the domain-specific lexicon to a general one. We show – across typologically diverse languages in PBC+ – good quality of seed and general-domain sentiment lexicons by intrin- sic and extrinsic and by automatic and human evaluation. We make freely available our code, seed sentiment lexicons for all 1593 languages and induced general-domain sentiment lexi- cons for 200 language
    corecore