7 research outputs found

    Part-of-Speech Tagging for Code-Mixed English-Hindi Twitter and Facebook Chat Messages

    No full text
    The paper reports work on collecting and annotating code-mixed English-Hindi so- cial media text (Twitter and Facebook messages), and experiments on automatic tagging of these corpora, using both a coarse-grained and a fine-grained part-of- speech tag set. We compare the perfor- mance of a combination of language spe- cific taggers to that of applying four ma- chine learning algorithms to the task (Con- ditional Random Fields, Sequential Mini- mal Optimization, Naïve Bayes and Ran- dom Forests), using a range of different features based on word context and word- internal informatio

    Studying Generalisability across Abusive Language Detection Datasets

    No full text
    Work on Abusive Language Detection has tackled a wide range of subtasks and domains. As a result of this, there exists a great deal of redundancy and non-generalisability between datasets. Through experiments on cross-dataset training and testing, the paper reveals that the preconceived notion of including more non-abusive samples in a dataset (to emulate reality) may have a detrimental effect on the generalisability of a model trained on that data. Hence a hierarchical annotation model is utilised here to reveal redundancies in existing datasets and to help reduce redundancy in future efforts

    NIT_Agartala_NLP_Team at SemEval-2019 Task 6: An Ensemble Approach to Identifying and Categorizing Offensive Language in Twitter Social Media Corpora

    No full text
    The paper describes the systems submitted to OffensEval (SemEval 2019, Task 6) on ‘Identifying and Categorizing Offensive Language in Social Media’ by the ‘NIT_Agartala_NLP_Team’. A Twitter annotated dataset of 13,240 English tweets was provided by the task organizers to train the individual models, with the best results obtained using an ensemble model composed of six different classifiers. The ensemble model produced macro-averaged F1-scores of 0.7434, 0.7078 and 0.4853 on Subtasks A, B, and C, respectively. The paper highlights the overall low predictive nature of various linguistic features and surface level count features, as well as the limitations of a traditional machine learning approach when compared to a Deep Learning counterpart

    Sentence Boundary Detection for Social Media Text

    No full text
    The paper presents a study on automatic sentence boundary detection in social me-dia texts such as Facebook messages and Twitter micro-blogs (tweets). We explore the limitations of using existing rule-based sentence boundary detection systems on social media text, and as an alternative in-vestigate applying three machine learning algorithms (Conditional Random Fields, Naïve Bayes, and Sequential Minimal Op-timization) to the task. The systems were tested on three corpora annotated with sentence boundaries, one containing more formal English text, one consisting of tweets and Facebook posts in English, and one with tweets in code-mixed English-Hindi. The results show that Naïve Bayes and Sequential Minimal Optimization were clearly more successful than the other approaches.
    corecore