5 research outputs found
Identification of monolingual and code-switch information from English-Kannada code-switch data
Code-switching is a very common occurrence in social media communication, predominantly found in multilingual countries like India. Using more than one language in communication is known as code-switching or code-mixing. Some of the important applications of code-switch are machine translation (MT), shallow parsing, dialog systems, and semantic parsing. Identifying code-switch and monolingual information is useful for better communication in online networking websites. In this paper, we performed a character level n-gram approach to identify monolingual and code-switch information from English-Kannada social media data. We paralleled various machine learning techniques such as naïve Bayes (NB), support vector classifier (SVC), logistic regression (LR) and neural network (NN) on English-Kannada code-switch (EKCS) data. From the proposed approach, it is observed that the character level n-gram approach provides 1.8% to 4.1% of improvement in terms of Accuracy and 1.6% to 3.8% of improvement in F1-score. Also observed that SVC and NN techniques are outperformed in terms of accuracy (97.9%) and F1-score (98%) with character level n-gram
Annotating for Hate Speech: The MaNeCo Corpus and Some Input from Critical Discourse Analysis
This paper presents a novel scheme for the annotation of hate speech in
corpora of Web 2.0 commentary. The proposed scheme is motivated by the critical
analysis of posts made in reaction to news reports on the Mediterranean
migration crisis and LGBTIQ+ matters in Malta, which was conducted under the
auspices of the EU-funded C.O.N.T.A.C.T. project. Based on the realization that
hate speech is not a clear-cut category to begin with, appears to belong to a
continuum of discriminatory discourse and is often realized through the use of
indirect linguistic means, it is argued that annotation schemes for its
detection should refrain from directly including the label 'hate speech,' as
different annotators might have different thresholds as to what constitutes
hate speech and what not. In view of this, we suggest a multi-layer annotation
scheme, which is pilot-tested against a binary +/- hate speech classification
and appears to yield higher inter-annotator agreement. Motivating the
postulation of our scheme, we then present the MaNeCo corpus on which it will
eventually be used; a substantial corpus of on-line newspaper comments spanning
10 years.Comment: 10 pages, 1 table. Appears in Proceedings of the 12th edition of the
Language Resources and Evaluation Conference (LREC'20
Recommended from our members
Applying corpus and computational methods to loanword research : new approaches to Anglicisms in Spanish
Understanding both the linguistic and social roles of loanwords is becoming more relevant as globalization has brought loanwords into new settings, often previously viewed as monolingual. Their occurrence has the potential to impact speech communities, in that they have the capacity to alter the semantic relationships and social values ascribed to individual elements within the existing lexicon. In order to identify broad patterns, we must turn towards large and varied sources of data, specifically corpora. This dissertation aims to tackle some of the practical issues involved in the use of corpora, while addressing two conceptual issues in the field of loanword research – the social distribution and semantic nature of loanwords. In this dissertation, I propose two methods, adapted from advances in computational linguistics, which will contribute to two different stages of loanword research: processing corpora to find tokens of interest and semantically analyzing tokens of interest. These methods will be employed in two case studies. The first seeks to explore the social stratification of loanwords in Argentine Spanish. The second measures the semantic specificity of loanwords relative to their native equivalents.Spanish and Portugues