Search CORE

805 research outputs found

Annotations of Connectives and Arguments in Malayalam Language

Author: Devi Sobha Lalitha
Sheeja S. Kumari
Publication venue: The Author(s). Published by Elsevier Ltd.
Publication date: 31/12/2016
Field of study

AbstractDiscourse relations in natural languages link clauses in text and compose overall text structure. Discourse connectives are an important part of modeling the Malayalam discourse structure. We followed the annotation procedure of Penn Discourse Tree Bank and worked on tagging of discourse connectives and arguments of Malayalam text and also report the senses of relation. We present our work on annotations of Malayalam discourse connectives and arguments which helps to know more about the discourse connectives and their appearance in case of semantic rules in Malayalam discourse. Discourse connectives may or may not be explicitly present in the relation. In our work, we focus on the annotation of both explicit and implicit connectives and arguments in Malayalam text and showed encouraging results

Elsevier - Publisher Connector

A Comprehensive Review of Sentiment Analysis on Indian Regional Languages: Techniques, Challenges, and Trends

Author: Kale Sunil D.
Mahalle Parikshit N.
Mane Deepak T.
Potdar Girish P.
Prasad Rajesh
Upadhye Gopal D.
Publication venue: Auricle Global Society of Education and Research
Publication date: 31/08/2023
Field of study

Sentiment analysis (SA) is the process of understanding emotion within a text. It helps identify the opinion, attitude, and tone of a text categorizing it into positive, negative, or neutral. SA is frequently used today as more and more people get a chance to put out their thoughts due to the advent of social media. Sentiment analysis benefits industries around the globe, like finance, advertising, marketing, travel, hospitality, etc. Although the majority of work done in this field is on global languages like English, in recent years, the importance of SA in local languages has also been widely recognized. This has led to considerable research in the analysis of Indian regional languages. This paper comprehensively reviews SA in the following major Indian Regional languages: Marathi, Hindi, Tamil, Telugu, Malayalam, Bengali, Gujarati, and Urdu. Furthermore, this paper presents techniques, challenges, findings, recent research trends, and future scope for enhancing results accuracy

International Journal on Recent and Innovation Trends in Computing and Communication

CRPC-DB – A Discourse Bank for Portuguese

Author: Lejeune Pierre
Mendes Amália
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2022
Field of study

info:eu-repo/semantics/publishedVersio

Universidade de Lisboa: Repositório.UL

Universal Dependencies Parsing for Colloquial Singaporean English

Author: Chan GuangYong Leonard
Chieu Hai Leong
Wang Hongmin
Yang Jie
Zhang Yue
Publication venue
Publication date: 01/01/2017
Field of study

Singlish can be interesting to the ACL community both linguistically as a major creole based on English, and computationally for information extraction and sentiment analysis of regional social media. We investigate dependency parsing of Singlish by constructing a dependency treebank under the Universal Dependencies scheme, and then training a neural network model by integrating English syntactic knowledge into a state-of-the-art parser trained on the Singlish treebank. Results show that English knowledge can lead to 25% relative error reduction, resulting in a parser of 84.47% accuracies. To the best of our knowledge, we are the first to use neural stacking to improve cross-lingual dependency parsing on low-resource languages. We make both our annotation and parser available for further research.Comment: Accepted by ACL 201

arXiv.org e-Print Archive

Crossref

2nd Conference on Language, Data and Knowledge (LDK 2019), May 20–23, 2019, Leipzig, Germany

Author: Buitelaar Paul
Chiarcos Christian
de Melo Gerard
Dojchinovski Milan
Eskevich Maria
Fäth Christian
Klimek Bettina
McCrae John P.
Publication venue
Publication date: 27/04/2023
Field of study

OPUS Augsburg

XLM-EMO: multilingual emotion prediction in social media text

Author: Bianchi Federico
Hovy Dirk
Nozza Debora
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2022
Field of study

Archivio istituzionale della Ricerca - Bocconi

Detection of Offensive and Threatening Online Content in a Low Resource Language

Author: Adam Fatima Muhammad
Inuwa-Dutse Isa
Zandam Abubakar Yakubu
Publication venue
Publication date: 17/11/2023
Field of study

Hausa is a major Chadic language, spoken by over 100 million people in Africa. However, from a computational linguistic perspective, it is considered a low-resource language, with limited resources to support Natural Language Processing (NLP) tasks. Online platforms often facilitate social interactions that can lead to the use of offensive and threatening language, which can go undetected due to the lack of detection systems designed for Hausa. This study aimed to address this issue by (1) conducting two user studies (n=308) to investigate cyberbullying-related issues, (2) collecting and annotating the first set of offensive and threatening datasets to support relevant downstream tasks in Hausa, (3) developing a detection system to flag offensive and threatening content, and (4) evaluating the detection system and the efficacy of the Google-based translation engine in detecting offensive and threatening terms in Hausa. We found that offensive and threatening content is quite common, particularly when discussing religion and politics. Our detection system was able to detect more than 70% of offensive and threatening content, although many of these were mistranslated by Google's translation engine. We attribute this to the subtle relationship between offensive and threatening content and idiomatic expressions in the Hausa language. We recommend that diverse stakeholders participate in understanding local conventions and demographics in order to develop a more effective detection system. These insights are essential for implementing targeted moderation strategies to create a safe and inclusive online environment.Comment: 25 pages, 5 figures, 8 table

arXiv.org e-Print Archive

DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text

Author: Chakravarthi Bharathi Raja
Jose Navya
McCrae John P.
Muralidaran Vigneshwaran
Priyadharshini Ruba
Sherly Elizabeth
Suryawanshi Shardul
Publication venue
Publication date: 17/06/2021
Field of study

This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7,000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff's alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning methods. The dataset is available on Github (https://github.com/bharathichezhiyan/DravidianCodeMix-Dataset) and Zenodo (https://zenodo.org/record/4750858\#.YJtw0SYo\_0M).Comment: 36 page

arXiv.org e-Print Archive

Online Research @ Cardiff

PubMed Central

IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages

Author: AI4Bharat
AK Raghavan
Chitale Pranjal A.
Dabre Raj
Doddapaneni Sumanth
Gala Jay
Gumma Varun
Khapra Mitesh M.
Kumar Aswanth
Kumar Pratyush
Kunchukuttan Anoop
Nawale Janki
Puduppully Ratish
Raghavan Vivek
Sujatha Anupama
Publication venue
Publication date: 17/06/2023
Field of study

India has a rich linguistic landscape with languages from 4 major language families spoken by over a billion people. 22 of these languages are listed in the Constitution of India (referred to as scheduled languages) are the focus of this work. Given the linguistic diversity, high-quality and accessible Machine Translation (MT) systems are essential in a country like India. Prior to this work, there was (i) no parallel training data spanning all the 22 languages, (ii) no robust benchmarks covering all these languages and containing content relevant to India, and (iii) no existing translation models which support all the 22 scheduled languages of India. In this work, we aim to address this gap by focusing on the missing pieces required for enabling wide, easy, and open access to good machine translation systems for all 22 scheduled Indian languages. We identify four key areas of improvement: curating and creating larger training datasets, creating diverse and high-quality benchmarks, training multilingual models, and releasing models with open access. Our first contribution is the release of the Bharat Parallel Corpus Collection (BPCC), the largest publicly available parallel corpora for Indic languages. BPCC contains a total of 230M bitext pairs, of which a total of 126M were newly added, including 644K manually translated sentence pairs created as part of this work. Our second contribution is the release of the first n-way parallel benchmark covering all 22 Indian languages, featuring diverse domains, Indian-origin content, and source-original test sets. Next, we present IndicTrans2, the first model to support all 22 languages, surpassing existing models on multiple existing and new benchmarks created as a part of this work. Lastly, to promote accessibility and collaboration, we release our models and associated data with permissive licenses at https://github.com/ai4bharat/IndicTrans2

arXiv.org e-Print Archive