3 research outputs found
A study of text representations in Hate Speech Detection
The pervasiveness of the Internet and social media have enabled the rapid and
anonymous spread of Hate Speech content on microblogging platforms such as
Twitter. Current EU and US legislation against hateful language, in conjunction
with the large amount of data produced in these platforms has led to automatic
tools being a necessary component of the Hate Speech detection task and
pipeline. In this study, we examine the performance of several, diverse text
representation techniques paired with multiple classification algorithms, on
the automatic Hate Speech detection and abusive language discrimination task.
We perform an experimental evaluation on binary and multiclass datasets, paired
with significance testing. Our results show that simple hate-keyword frequency
features (BoW) work best, followed by pre-trained word embeddings (GLoVe) as
well as N-gram graphs (NGGs): a graph-based representation which proved to
produce efficient, very low-dimensional but rich features for this task. A
combination of these representations paired with Logistic Regression or 3-layer
neural network classifiers achieved the best detection performance, in terms of
micro and macro F-measure.Comment: 14 pages, CICLing201
MUDOS-NG: Multi-document Summaries Using N-gram Graphs (Tech Report)
This report describes the MUDOS-NG summarization system, which applies a set
of language-independent and generic methods for generating extractive
summaries. The proposed methods are mostly combinations of simple operators on
a generic character n-gram graph representation of texts. This work defines the
set of used operators upon n-gram graphs and proposes using these operators
within the multi-document summarization process in such subtasks as document
analysis, salient sentence selection, query expansion and redundancy control.
Furthermore, a novel chunking methodology is used, together with a novel way to
assign concepts to sentences for query expansion. The experimental results of
the summarization system, performed upon widely used corpora from the Document
Understanding and the Text Analysis Conferences, are promising and provide
evidence for the potential of the generic methods introduced. This work aims to
designate core methods exploiting the n-gram graph representation, providing
the basis for more advanced summarization systems.Comment: Technical Repor
Testing the use of n-gram graphs in summarization sub-tasks
Abstract. Within this article, we sketch the set of generic tools we have devised and used within the summarization process and the domain of summary evaluation, focusing on how the tools were used within the TAC 2008 summarization update challenge. The tools have a common underlying theory and provide utility in various aspects of the Natural Language Processing domain. Within this study we elaborate on query expansion, content matching and filtering, redundancy removal as well as summary evaluation. 1