34,138 research outputs found
Functional Text Dimensions for the annotation of web corpora
This paper presents an approach to classifying large web corpora into genres by means of Functional Text Dimensions (FTDs). This offers a topological approach to text typology in which the texts are described in terms of their similarity to prototype genres. The suggested set of categories is designed to be applicable to any text on the web and to be reliable in annotation practice. Interannotator agreement results show that the suggested categories produce Krippendorff's α at above 0.76. In addition to the functional space of eighteen dimensions, similarity between annotated documents can be described visually within a space of reduced dimensions obtained through t-distributed Statistical Neighbour Embedding. Reliably annotated texts also provide the basis for automatic genre classification, which can be done in each FTD, as well as as within the space of reduced dimensions. An example comparing texts from the Brown Corpus, the BNC and ukWac, a large web corpus, is provided
Topology comparison of Twitter diffusion networks effectively reveals misleading information
In recent years, malicious information had an explosive growth in social
media, with serious social and political backlashes. Recent important studies,
featuring large-scale analyses, have produced deeper knowledge about this
phenomenon, showing that misleading information spreads faster, deeper and more
broadly than factual information on social media, where echo chambers,
algorithmic and human biases play an important role in diffusion networks.
Following these directions, we explore the possibility of classifying news
articles circulating on social media based exclusively on a topological
analysis of their diffusion networks. To this aim we collected a large dataset
of diffusion networks on Twitter pertaining to news articles published on two
distinct classes of sources, namely outlets that convey mainstream, reliable
and objective information and those that fabricate and disseminate various
kinds of misleading articles, including false news intended to harm, satire
intended to make people laugh, click-bait news that may be entirely factual or
rumors that are unproven. We carried out an extensive comparison of these
networks using several alignment-free approaches including basic network
properties, centrality measures distributions, and network distances. We
accordingly evaluated to what extent these techniques allow to discriminate
between the networks associated to the aforementioned news domains. Our results
highlight that the communities of users spreading mainstream news, compared to
those sharing misleading news, tend to shape diffusion networks with subtle yet
systematic differences which might be effectively employed to identify
misleading and harmful information.Comment: A revised new version is available on Scientific Report
Graph-based Features for Automatic Online Abuse Detection
While online communities have become increasingly important over the years,
the moderation of user-generated content is still performed mostly manually.
Automating this task is an important step in reducing the financial cost
associated with moderation, but the majority of automated approaches strictly
based on message content are highly vulnerable to intentional obfuscation. In
this paper, we discuss methods for extracting conversational networks based on
raw multi-participant chat logs, and we study the contribution of graph
features to a classification system that aims to determine if a given message
is abusive. The conversational graph-based system yields unexpectedly high
performance , with results comparable to those previously obtained with a
content-based approach
- …