LOD-Connected Offensive Language Ontology and Tagset Enrichment

Abstract

CC BY 4.0The main focus of the paper is the definitional revision and enrichment of offensive language typology, making reference to publicly available offensive language datasets and testing them on available pretrained lexical embedding systems. We review over 60 available corpora and compare tagging schemas applied there while making an attempt to explain semantic differences between particular concepts of the category OFFENSIVE in English. A finite set of classes that cover aspects of offensive language representation along with linguistically sound explanations is presented, based on the categories originally proposed by Zampieri et al. [1, 2] in terms of offensive language categorization schemata and tested by means of Sketch Engine tools on a large web-based corpus. The schemata are juxtaposed and discussed with reference to non-contextual word embeddings FastText, Word2Vec, and Glove. The methodology for mapping from existing corpora to a unified ontology as presented in this paper is provided. The proposed schema will enable further comparable research and effective use of corpora of languages other than English. It will also be applied in building an enriched tagset to be trained and used on new data, with the application of recently developed LLOD techniques [3]

    Similar works