7 research outputs found
Part-of-Speech Tagging for Code-Mixed English-Hindi Twitter and Facebook Chat Messages
The paper reports work on collecting and
annotating code-mixed English-Hindi so-
cial media text (Twitter and Facebook
messages), and experiments on automatic
tagging of these corpora, using both a
coarse-grained and a fine-grained part-of-
speech tag set. We compare the perfor-
mance of a combination of language spe-
cific taggers to that of applying four ma-
chine learning algorithms to the task (Con-
ditional Random Fields, Sequential Mini-
mal Optimization, Naïve Bayes and Ran-
dom Forests), using a range of different
features based on word context and word-
internal informatio
Studying Generalisability across Abusive Language Detection Datasets
Work on Abusive Language Detection has tackled a wide range of subtasks and domains. As a result of this, there exists a great deal of redundancy and non-generalisability between datasets. Through experiments on cross-dataset training and testing, the paper reveals that the preconceived notion of including more non-abusive samples in a dataset (to emulate reality) may have a detrimental effect on the generalisability of a model trained on that data. Hence a hierarchical annotation model is utilised here to reveal redundancies in existing datasets and to help reduce redundancy in future efforts
NIT_Agartala_NLP_Team at SemEval-2019 Task 6: An Ensemble Approach to Identifying and Categorizing Offensive Language in Twitter Social Media Corpora
The paper describes the systems submitted to OffensEval (SemEval 2019, Task 6) on ‘Identifying and Categorizing Offensive Language in Social Media’ by the ‘NIT_Agartala_NLP_Team’. A Twitter annotated dataset of 13,240 English tweets was provided by the task organizers to train the individual models, with the best results obtained using an ensemble model composed of six different classifiers. The ensemble model produced macro-averaged F1-scores of 0.7434, 0.7078 and 0.4853 on Subtasks A, B, and C, respectively. The paper highlights the overall low predictive nature of various linguistic features and surface level count features, as well as the limitations of a traditional machine learning approach when compared to a Deep Learning counterpart
Sentence Boundary Detection for Social Media Text
The paper presents a study on automatic sentence boundary detection in social me-dia texts such as Facebook messages and Twitter micro-blogs (tweets). We explore the limitations of using existing rule-based sentence boundary detection systems on social media text, and as an alternative in-vestigate applying three machine learning algorithms (Conditional Random Fields, Naïve Bayes, and Sequential Minimal Op-timization) to the task. The systems were tested on three corpora annotated with sentence boundaries, one containing more formal English text, one consisting of tweets and Facebook posts in English, and one with tweets in code-mixed English-Hindi. The results show that Naïve Bayes and Sequential Minimal Optimization were clearly more successful than the other approaches.