20 research outputs found

    Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text

    Get PDF
    Understanding the sentiment of a comment from a video or an image is an essential task in many applications. Sentiment analysis of a text can be useful for various decision-making processes. One such application is to analyse the popular sentiments of videos on social media based on viewer comments. However, comments from social media do not follow strict rules of grammar, and they contain mixing of more than one language, often written in non-native scripts. Non-availability of annotated code-mixed data for a low-resourced language like Tamil also adds difficulty to this problem. To overcome this, we created a gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube. In this paper, we describe the process of creating the corpus and assigning polarities. We present inter-annotator agreement and show the results of sentiment analysis trained on this corpus as a benchmark

    A Survey of Cross-Lingual Sentiment Analysis Based on Pre-Trained Models

    Get PDF
    With the technology development of natural language processing, many researchers have studied Machine Learning (ML), Deep Learning (DL), monolingual Sentiment Analysis (SA) widely. However, there is not much work on Cross-Lingual SA (CLSA), although it is beneficial when dealing with low resource languages (e.g., Tamil, Malayalam, Hindi, and Arabic). This paper surveys the main challenges and issues of CLSA based on some pre-trained language models and mentions the leading methods to cope with CLSA. In particular, we compare and analyze their pros and cons. Moreover, we summarize the valuable cross-lingual resources and point out the main problems researchers need to solve in the future

    A Sentiment Analysis Dataset for Code-Mixed Malayalam-English

    Get PDF
    There is an increasing demand for sentiment analysis of text from social media which are mostly code-mixed. Systems trained on monolingual data fail for code-mixed data due to the complexity of mixing at different levels of the text. However, very few resources are available for code-mixed data to create models specific for this data. Although much research in multilingual and cross-lingual sentiment analysis has used semi-supervised or unsupervised methods, supervised methods still performs better. Only a few datasets for popular languages such as English-Spanish, English-Hindi, and English-Chinese are available. There are no resources available for Malayalam-English code-mixed data. This paper presents a new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators. This gold standard corpus obtained a Krippendorff's alpha above 0.8 for the dataset. We use this new corpus to provide the benchmark for sentiment analysis in Malayalam-English code-mixed texts

    Bilingual lexicon induction across orthographically-distinct under-resourced Dravidian languages

    Get PDF
    Bilingual lexicons are a vital tool for under-resourced languages and recent state-of-the-art approaches to this leverage pretrained monolingual word embeddings using supervised or semi- supervised approaches. However, these approaches require cross-lingual information such as seed dictionaries to train the model and find a linear transformation between the word embedding spaces. Especially in the case of low-resourced languages, seed dictionaries are not readily available, and as such, these methods produce extremely weak results on these languages. In this work, we focus on the Dravidian languages, namely Tamil, Telugu, Kannada, and Malayalam, which are even more challenging as they are written in unique scripts. To take advantage of orthographic information and cognates in these languages, we bring the related languages into a single script. Previous approaches have used linguistically sub-optimal measures such as the Levenshtein edit distance to detect cognates, whereby we demonstrate that the longest common sub-sequence is linguistically more sound and improves the performance of bilingual lexicon induction. We show that our approach can increase the accuracy of bilingual lexicon induction methods on these languages many times, making bilingual lexicon induction approaches feasible for such under-resourced languages

    DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text

    Get PDF
    This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7,000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff's alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning methods. The dataset is available on Github (https://github.com/bharathichezhiyan/DravidianCodeMix-Dataset) and Zenodo (https://zenodo.org/record/4750858\#.YJtw0SYo\_0M).Comment: 36 page

    Establishing a Role for Minority Source Language in Multilingual Facilitation

    Get PDF
    This is a contribution to a festschrift.This document is dedicated to a young man, who, despite the number of times he has traveled around the Sun, is always open to new thoughts on ways to include languages, especially the smaller ones, and the people who speak them in far-reaching and sustainable open-source de- velopment. Since Trond Trosterud in Tromsø is attributed a terrific track record in transnational and circum-polar linguistics, we try to attract his attention further afield, to languages and phe- nomena he has only touched. The language phenomena addressed here come from Erzya and the Zyrian variety of Komi; Erzya has issues presented but not discussed in his dissertation, whereas Komi brings in issues of adnominal and predicate number marking in conjunction with case homonymy that have been resolved thanks to the flexibility of the infrastructure. These source languages, like others, have documented new dimensions and added shape the ever- growing infrastructure.Peer reviewe

    Extended Parallel Corpus for Amharic-English Machine Translation

    Full text link
    This paper describes the acquisition, preprocessing, segmentation, and alignment of an Amharic-English parallel corpus. It will be useful for machine translation of an under-resourced language, Amharic. The corpus is larger than previously compiled corpora; it is released for research purposes. We trained neural machine translation and phrase-based statistical machine translation models using the corpus. In the automatic evaluation, neural machine translation models outperform phrase-based statistical machine translation models.Comment: Accepted to 2nd AfricanNLP workshop at EACL 202
    corecore