182 research outputs found
Automatic text categorisation of racist webpages
Automatic Text Categorisation (TC) involves the assignment of one or more predefined categories to text documents in order that they can be effectively managed. In this thesis we examine the possibility of applying automatic text categorisation to the problem of categorising texts (web pages) based on whether or not they are racist.
TC has proven successful for topic-based problems such as news story categorisation. However, the problem of detecting racism is dissimilar to topic-based problems in that lexical items present in racist documents can also appear in anti-racist documents or indeed potentially any document. The mere presence of a potentially racist term does not necessarily mean the document is racist. The difficulty is finding what discerns racist documents from non-racist.
We use a machine learning method called Support Vector Machines (SVM) to automatically learn features of racism in order to be capable of making a decision about the target class of unseen documents. We examine various representations within an SVM so as to identify the most effective method for handling this problem. Our work shows that it is possible to develop automatic categorisation of web pages, based on these approache
Second-generation p-values: improved rigor, reproducibility, & transparency in statistical analyses
Verifying that a statistically significant result is scientifically
meaningful is not only good scientific practice, it is a natural way to control
the Type I error rate. Here we introduce a novel extension of the p-value - a
second-generation p-value - that formally accounts for scientific relevance and
leverages this natural Type I Error control. The approach relies on a
pre-specified interval null hypothesis that represents the collection of effect
sizes that are scientifically uninteresting or are practically null. The
second-generation p-value is the proportion of data-supported hypotheses that
are also null hypotheses. As such, second-generation p-values indicate when the
data are compatible with null hypotheses, or with alternative hypotheses, or
when the data are inconclusive. Moreover, second-generation p-values provide a
proper scientific adjustment for multiple comparisons and reduce false
discovery rates. This is an advance for environments rich in data, where
traditional p-value adjustments are needlessly punitive. Second-generation
p-values promote transparency, rigor and reproducibility of scientific results
by a priori specifying which candidate hypotheses are practically meaningful
and by providing a more reliable statistical summary of when the data are
compatible with alternative or null hypotheses.Comment: 29 pages, 29 page Supplemen
Classifying racist texts using a support vector machine
In this poster we present an overview of the techniques we used to develop and evaluate a text categorisation system to automatically classify racist texts. Detecting racism is difficult because the presence of indicator words is insufficient to indicate racist texts, unlike some other text classification tasks. Support Vector Machines (SVM) are used to automatically categorise web pages based on whether or not they are racist. Different interpretations of what constitutes a term are taken, and in this poster we look at three representations of a web page within an SVM -- bag-of-words, bigrams and part-of-speech tags
- …