3 research outputs found

    A study of text representations in Hate Speech Detection

    Full text link
    The pervasiveness of the Internet and social media have enabled the rapid and anonymous spread of Hate Speech content on microblogging platforms such as Twitter. Current EU and US legislation against hateful language, in conjunction with the large amount of data produced in these platforms has led to automatic tools being a necessary component of the Hate Speech detection task and pipeline. In this study, we examine the performance of several, diverse text representation techniques paired with multiple classification algorithms, on the automatic Hate Speech detection and abusive language discrimination task. We perform an experimental evaluation on binary and multiclass datasets, paired with significance testing. Our results show that simple hate-keyword frequency features (BoW) work best, followed by pre-trained word embeddings (GLoVe) as well as N-gram graphs (NGGs): a graph-based representation which proved to produce efficient, very low-dimensional but rich features for this task. A combination of these representations paired with Logistic Regression or 3-layer neural network classifiers achieved the best detection performance, in terms of micro and macro F-measure.Comment: 14 pages, CICLing201

    MUDOS-NG: Multi-document Summaries Using N-gram Graphs (Tech Report)

    Full text link
    This report describes the MUDOS-NG summarization system, which applies a set of language-independent and generic methods for generating extractive summaries. The proposed methods are mostly combinations of simple operators on a generic character n-gram graph representation of texts. This work defines the set of used operators upon n-gram graphs and proposes using these operators within the multi-document summarization process in such subtasks as document analysis, salient sentence selection, query expansion and redundancy control. Furthermore, a novel chunking methodology is used, together with a novel way to assign concepts to sentences for query expansion. The experimental results of the summarization system, performed upon widely used corpora from the Document Understanding and the Text Analysis Conferences, are promising and provide evidence for the potential of the generic methods introduced. This work aims to designate core methods exploiting the n-gram graph representation, providing the basis for more advanced summarization systems.Comment: Technical Repor

    Testing the use of n-gram graphs in summarization sub-tasks

    No full text
    Abstract. Within this article, we sketch the set of generic tools we have devised and used within the summarization process and the domain of summary evaluation, focusing on how the tools were used within the TAC 2008 summarization update challenge. The tools have a common underlying theory and provide utility in various aspects of the Natural Language Processing domain. Within this study we elaborate on query expansion, content matching and filtering, redundancy removal as well as summary evaluation. 1
    corecore