128 research outputs found
A Comprehensive Review of Sentiment Analysis on Indian Regional Languages: Techniques, Challenges, and Trends
Sentiment analysis (SA) is the process of understanding emotion within a text. It helps identify the opinion, attitude, and tone of a text categorizing it into positive, negative, or neutral. SA is frequently used today as more and more people get a chance to put out their thoughts due to the advent of social media. Sentiment analysis benefits industries around the globe, like finance, advertising, marketing, travel, hospitality, etc. Although the majority of work done in this field is on global languages like English, in recent years, the importance of SA in local languages has also been widely recognized. This has led to considerable research in the analysis of Indian regional languages. This paper comprehensively reviews SA in the following major Indian Regional languages: Marathi, Hindi, Tamil, Telugu, Malayalam, Bengali, Gujarati, and Urdu. Furthermore, this paper presents techniques, challenges, findings, recent research trends, and future scope for enhancing results accuracy
Natural language processing
Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems
Overview of Abusive and Threatening Language Detection in Urdu at FIRE 2021
With the growth of social media platform influence, the effect of their
misuse becomes more and more impactful. The importance of automatic detection
of threatening and abusive language can not be overestimated. However, most of
the existing studies and state-of-the-art methods focus on English as the
target language, with limited work on low- and medium-resource languages. In
this paper, we present two shared tasks of abusive and threatening language
detection for the Urdu language which has more than 170 million speakers
worldwide. Both are posed as binary classification tasks where participating
systems are required to classify tweets in Urdu into two classes, namely: (i)
Abusive and Non-Abusive for the first task, and (ii) Threatening and
Non-Threatening for the second. We present two manually annotated datasets
containing tweets labelled as (i) Abusive and Non-Abusive, and (ii) Threatening
and Non-Threatening. The abusive dataset contains 2400 annotated tweets in the
train part and 1100 annotated tweets in the test part. The threatening dataset
contains 6000 annotated tweets in the train part and 3950 annotated tweets in
the test part. We also provide logistic regression and BERT-based baseline
classifiers for both tasks. In this shared task, 21 teams from six countries
registered for participation (India, Pakistan, China, Malaysia, United Arab
Emirates, and Taiwan), 10 teams submitted their runs for Subtask A, which is
Abusive Language Detection and 9 teams submitted their runs for Subtask B,
which is Threatening Language detection, and seven teams submitted their
technical reports. The best performing system achieved an F1-score value of
0.880 for Subtask A and 0.545 for Subtask B. For both subtasks, m-Bert based
transformer model showed the best performance
An Urdu semantic tagger - lexicons, corpora, methods and tools
Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as Natural Language Processing (NLP), corpus linguistics, data sciences, etc. An important aspect of such automatic information extraction and analysis is the semantic annotation of language data using semantic annotation tool (a.k.a semantic tagger). Generally, different semantic annotation tools have been designed to carry out various levels of semantic annotations, for instance, sentiment analysis, word sense disambiguation, content analysis, semantic role labelling, etc. These semantic annotation tools identify or tag partial core semantic information of language data, moreover, they tend to be applicable only for English and other European languages. A semantic annotation tool that can annotate semantic senses of all lexical units (words) is still desirable for the Urdu language based on USAS (the UCREL Semantic Analysis System) semantic taxonomy, in order to provide comprehensive semantic analysis of Urdu language text. This research work report on the development of an Urdu semantic tagging tool and discuss challenging issues which have been faced in this Ph.D. research work. Since standard NLP pipeline tools are not widely available for Urdu, alongside the Urdu semantic tagger a suite of newly developed tools have been created: sentence tokenizer, word tokenizer and part-of-speech tagger. Results for these proposed tools are as follows: word tokenizer reports of 94.01\%, and accuracy of 97.21\%, sentence tokenizer shows F of 92.59\%, and accuracy of 93.15\%, whereas, POS tagger shows an accuracy of 95.14\%. The Urdu semantic tagger incorporates semantic resources (lexicon and corpora) as well as semantic field disambiguation methods. In terms of novelty, the NLP pre-processing tools are developed either using rule-based, statistical, or hybrid techniques. Furthermore, all semantic lexicons have been developed using a novel combination of automatic or semi-automatic approaches: mapping, crowdsourcing, statistical machine translation, GIZA++, word embeddings, and named entity. A large multi-target annotated corpus is also constructed using a semi-automatic approach to test accuracy of the Urdu semantic tagger, proposed corpus is also used to train and test supervised multi-target Machine Learning classifiers. The results show that Random k-labEL Disjoint Pruned Sets and Classifier Chain multi-target classifiers outperform all other classifiers on the proposed corpus with a Hamming Loss of 0.06\% and Accuracy of 0.94\%. The best lexical coverage of 88.59\%, 99.63\%, 96.71\% and 89.63\% are obtained on several test corpora. The developed Urdu semantic tagger shows encouraging precision on the proposed test corpus of 79.47\%
Overview of the Shared Task on Fake News Detection in Urdu at FIRE 2021
Automatic detection of fake news is a highly important task in the
contemporary world. This study reports the 2nd shared task called
UrduFake@FIRE2021 on identifying fake news detection in Urdu. The goal of the
shared task is to motivate the community to come up with efficient methods for
solving this vital problem, particularly for the Urdu language. The task is
posed as a binary classification problem to label a given news article as a
real or a fake news article. The organizers provide a dataset comprising news
in five domains: (i) Health, (ii) Sports, (iii) Showbiz, (iv) Technology, and
(v) Business, split into training and testing sets. The training set contains
1300 annotated news articles -- 750 real news, 550 fake news, while the testing
set contains 300 news articles -- 200 real, 100 fake news. 34 teams from 7
different countries (China, Egypt, Israel, India, Mexico, Pakistan, and UAE)
registered to participate in the UrduFake@FIRE2021 shared task. Out of those,
18 teams submitted their experimental results, and 11 of those submitted their
technical reports, which is substantially higher compared to the UrduFake
shared task in 2020 when only 6 teams submitted their technical reports. The
technical reports submitted by the participants demonstrated different data
representation techniques ranging from count-based BoW features to word vector
embeddings as well as the use of numerous machine learning algorithms ranging
from traditional SVM to various neural network architectures including
Transformers such as BERT and RoBERTa. In this year's competition, the best
performing system obtained an F1-macro score of 0.679, which is lower than the
past year's best result of 0.907 F1-macro. Admittedly, while training sets from
the past and the current years overlap to a large extent, the testing set
provided this year is completely different
Causative alternation licensing in Urdu: An event structure account
Given the central role of the verb in clause structure, it is vital to understand the properties of the SEMANTIC ROOT and the EVENT SCHEMA, two constituent aspects of verb meaning, in order to understand how lexical semantic categories relate to syntactic categories. The nature of the interface between these components can, in turn, reveal the overall design of language. However, the main challenge is to make precise the nature of the semantic root and event schema, and their interactive role in argument realization options. To address this challenge, empirical evidence from diverse languages is required to determine how argument realization can be universally accounted for in terms of semantic root and event schema-based lexical semantic representation. The primary purpose of this study is to explicate the roles of semantic root and event schema in Urdu change-of-state (COS) verbs’ causative alternation, formulating licensing conditions on the lexical semantics-syntax interface involved in the phenomenon. On the semantic side of the interface, the argumentation is framed within Rappaport-Hovav and Levin’s (1998a) event structure account, and on the syntactic side, the study assumes Culicover and Jackendoff’s (2005) Simpler Syntax which accounts for an alternation in terms of constraint-based interface principles.
Given that the adequacy of theory is bound up with the reliability of empirical evidence, this study is based on data from multiple sources (lexical translation, Urdu WordNet, Urdu Lughat, individual and dialogical introspection, and speaker survey), conducts extensive analysis of morphosemantic as well as morphosyntactic aspects of 112 Urdu COS verbs, and shows that the causative alternation results from an interaction of multiple licensing factors.
The study reaches the following conclusions: (a) The anticausative form of a COS verb is basic and causative forms are derived. (b) The causative derivation shows gradient and dynamic productivity, and an interaction between lexical schemas and morphological operations, marking the CAUSE relation which reflects causal responsibility between the event participants. (c) An anticausative lexicalizes both manner and result, with a [BECOME [Y ]] event structure. (d) An anticausative’s event schema and root license only the patient argument; any additional argument is licensed by the root. The cause arguments in causatives are introduced by causative operations, and are obligatorily event schema participants. The syntactic realization of semantic arguments is sensitive to the causal responsibility relation which is reflected in the predicate’s event structure through the primitive predicate CAUSE and its relation with ACT and BECOME. (e) The various aspects of Urdu COS verbs’ causative alternation lead us to the linking rules which show that the argument structure reflects the semantics it inherits from its semantic sources of roots and event schema.
Overall the study shows that the event structure account of Urdu COS verbs’ causative alternation supports the decomposition of the grammar into independent generative components that interact through interface rules. The bottom line is that such a syntax-semantics interface formulation of alternation avoids syntactic complexity
- …