4,376 research outputs found
Neural Based Statement Classification for Biased Language
Biased language commonly occurs around topics which are of controversial
nature, thus, stirring disagreement between the different involved parties of a
discussion. This is due to the fact that for language and its use,
specifically, the understanding and use of phrases, the stances are cohesive
within the particular groups. However, such cohesiveness does not hold across
groups.
In collaborative environments or environments where impartial language is
desired (e.g. Wikipedia, news media), statements and the language therein
should represent equally the involved parties and be neutrally phrased. Biased
language is introduced through the presence of inflammatory words or phrases,
or statements that may be incorrect or one-sided, thus violating such
consensus.
In this work, we focus on the specific case of phrasing bias, which may be
introduced through specific inflammatory words or phrases in a statement. For
this purpose, we propose an approach that relies on a recurrent neural networks
in order to capture the inter-dependencies between words in a phrase that
introduced bias.
We perform a thorough experimental evaluation, where we show the advantages
of a neural based approach over competitors that rely on word lexicons and
other hand-crafted features in detecting biased language. We are able to
distinguish biased statements with a precision of P=0.92, thus significantly
outperforming baseline models with an improvement of over 30%. Finally, we
release the largest corpus of statements annotated for biased language.Comment: The Twelfth ACM International Conference on Web Search and Data
Mining, February 11--15, 2019, Melbourne, VIC, Australi
Methods for detecting and mitigating linguistic bias in text corpora
Im Zuge der fortschreitenden Ausbreitung des Webs in alle Aspekte des tÀglichen
Lebens wird Bias in Form von Voreingenommenheit und versteckten Meinungen zu einem
zunehmend herausfordernden Problem. Eine weitverbreitete Erscheinungsform ist Bias in
Textdaten. Um dem entgegenzuwirken hat die Online-EnzyklopÀdie Wikipedia das Prinzip
des neutralen Standpunkts (Englisch: Neutral Point of View, kurz: NPOV) eingefĂŒhrt,
welcher die Verwendung neutraler Sprache und die Vermeidung von einseitigen oder subjektiven
Formulierungen vorschreibt. WÀhrend Studien gezeigt haben, dass die QualitÀt von
Wikipedia-Artikel mit der QualitÀt von Artikeln in klassischen EnzyklopÀdien vergleichbar
ist, zeigt die Forschung gleichzeitig auch, dass Wikipedia anfĂ€llig fĂŒr verschiedene Typen
von NPOV-Verletzungen ist. Bias zu identifizieren, kann eine herausfordernde Aufgabe sein,
sogar fĂŒr Menschen, und mit Millionen von Artikeln und einer zurĂŒckgehenden Anzahl von
Mitwirkenden wird diese Aufgabe zunehmend schwieriger. Wenn Bias nicht eingedÀmmt
wird, kann dies nicht nur zu Polarisierungen und Konflikten zwischen Meinungsgruppen
fĂŒhren, sondern Nutzer auch negativ in ihrer freien Meinungsbildung beeinflussen. Hinzu
kommt, dass sich Bias in Texten und in Ground-Truth-Daten negativ auf Machine Learning
Modelle, die auf diesen Daten trainiert werden, auswirken kann, was zu diskriminierendem
Verhalten von Modellen fĂŒhren kann.
In dieser Arbeit beschÀftigen wir uns mit Bias, indem wir uns auf drei zentrale Aspekte
konzentrieren: Bias-Inhalte in Form von geschriebenen Aussagen, Bias von Crowdworkern
wÀhrend des Annotierens von Daten und Bias in Word Embeddings ReprÀsentationen.
Wir stellen zwei AnsĂ€tze fĂŒr die Identifizierung von Aussagen mit Bias in Textsammlungen
wie Wikipedia vor. Unser auf Features basierender Ansatz verwendet Bag-of-Word
Features inklusive einer Liste von Bias-Wörtern, die wir durch das Identifizieren von Clustern
von Bias-Wörtern im Vektorraum von Word Embeddings zusammengestellt haben.
Unser verbesserter, neuronaler Ansatz verwendet Gated Recurrent Neural Networks, um
Kontext-AbhÀngigkeiten zu erfassen und die Performance des Modells weiter zu verbessern.
Unsere Studie zum Thema Crowd Worker Bias deckt Bias-Verhalten von Crowdworkern
mit extremen Meinungen zu einem bestimmten Thema auf und zeigt, dass dieses Verhalten
die entstehenden Ground-Truth-Label beeinflusst, was wiederum Einfluss auf die Erstellung
von DatensĂ€tzen fĂŒr Aufgaben wie Bias Identifizierung oder Sentiment Analysis hat. Wir
stellen AnsĂ€tze fĂŒr die AbschwĂ€chung von Worker Bias vor, die Bewusstsein unter den
Workern erzeugen und das Konzept der sozialen Projektion verwenden.
SchlieĂlich beschĂ€ftigen wir uns mit dem Problem von Bias in Word Embeddings,
indem wir uns auf das Beispiel von variierenden Sentiment-Scores fĂŒr Namen konzentrieren.
Wir zeigen, dass Bias in den Trainingsdaten von den Embeddings erfasst und an
nachgelagerte Modelle weitergegeben wird. In diesem Zusammenhang stellen wir einen
Debiasing-Ansatz vor, der den Bias-Effekt reduziert und sich positiv auf die produzierten
Label eines nachgeschalteten Sentiment Classifiers auswirkt
Automatic Detection of Online Jihadist Hate Speech
We have developed a system that automatically detects online jihadist hate
speech with over 80% accuracy, by using techniques from Natural Language
Processing and Machine Learning. The system is trained on a corpus of 45,000
subversive Twitter messages collected from October 2014 to December 2016. We
present a qualitative and quantitative analysis of the jihadist rhetoric in the
corpus, examine the network of Twitter users, outline the technical procedure
used to train the system, and discuss examples of use.Comment: 31 page
Uncertainty Detection as Approximate Max-Margin Sequence Labelling
This paper reports experiments for the CoNLL 2010 shared task on learning to detect hedges and their scope in natural language text. We have addressed the experimental tasks as supervised linear maximum margin prediction problems. For sentence level hedge detection in the biological domain we use an L1-regularised binary support vector machine, while for sentence level weasel detection in the Wikipedia domain, we use an L2-regularised approach. We model the in-sentence uncertainty cue and scope detection task as an L2-regularised approximate maximum margin sequence labelling problem, using the BIO-encoding. In addition to surface level features, we use a variety of linguistic features based on a functional dependency analysis. A greedy forward selection strategy is used in exploring the large set of potential features.
Our official results for Task 1 for the biological domain are 85.2 F1-score, for the Wikipedia set 55.4 F1-score. For Task 2, our official results are 2.1 for the entire task with a score of 62.5 for cue detection. After resolving errors and final bugs, our final results are for Task 1, biological: 86.0, Wikipedia: 58.2; Task 2, scopes: 39.6 and cues: 78.5
NELA-GT-2018: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles
In this paper, we present a dataset of 713k articles collected between
02/2018-11/2018. These articles are collected directly from 194 news and media
outlets including mainstream, hyper-partisan, and conspiracy sources. We
incorporate ground truth ratings of the sources from 8 different assessment
sites covering multiple dimensions of veracity, including reliability, bias,
transparency, adherence to journalistic standards, and consumer trust. The
NELA-GT-2018 dataset can be found at https://doi.org/10.7910/DVN/ULHLCB.Comment: Published at ICWSM 201
Pushing Your Point of View: Behavioral Measures of Manipulation in Wikipedia
As a major source for information on virtually any topic, Wikipedia serves an
important role in public dissemination and consumption of knowledge. As a
result, it presents tremendous potential for people to promulgate their own
points of view; such efforts may be more subtle than typical vandalism. In this
paper, we introduce new behavioral metrics to quantify the level of controversy
associated with a particular user: a Controversy Score (C-Score) based on the
amount of attention the user focuses on controversial pages, and a Clustered
Controversy Score (CC-Score) that also takes into account topical clustering.
We show that both these measures are useful for identifying people who try to
"push" their points of view, by showing that they are good predictors of which
editors get blocked. The metrics can be used to triage potential POV pushers.
We apply this idea to a dataset of users who requested promotion to
administrator status and easily identify some editors who significantly changed
their behavior upon becoming administrators. At the same time, such behavior is
not rampant. Those who are promoted to administrator status tend to have more
stable behavior than comparable groups of prolific editors. This suggests that
the Adminship process works well, and that the Wikipedia community is not
overwhelmed by users who become administrators to promote their own points of
view
False News On Social Media: A Data-Driven Survey
In the past few years, the research community has dedicated growing interest
to the issue of false news circulating on social networks. The widespread
attention on detecting and characterizing false news has been motivated by
considerable backlashes of this threat against the real world. As a matter of
fact, social media platforms exhibit peculiar characteristics, with respect to
traditional news outlets, which have been particularly favorable to the
proliferation of deceptive information. They also present unique challenges for
all kind of potential interventions on the subject. As this issue becomes of
global concern, it is also gaining more attention in academia. The aim of this
survey is to offer a comprehensive study on the recent advances in terms of
detection, characterization and mitigation of false news that propagate on
social media, as well as the challenges and the open questions that await
future research on the field. We use a data-driven approach, focusing on a
classification of the features that are used in each study to characterize
false information and on the datasets used for instructing classification
methods. At the end of the survey, we highlight emerging approaches that look
most promising for addressing false news
Exploiting Social Network Structure for Person-to-Person Sentiment Analysis
Person-to-person evaluations are prevalent in all kinds of discourse and
important for establishing reputations, building social bonds, and shaping
public opinion. Such evaluations can be analyzed separately using signed social
networks and textual sentiment analysis, but this misses the rich interactions
between language and social context. To capture such interactions, we develop a
model that predicts individual A's opinion of individual B by synthesizing
information from the signed social network in which A and B are embedded with
sentiment analysis of the evaluative texts relating A to B. We prove that this
problem is NP-hard but can be relaxed to an efficiently solvable hinge-loss
Markov random field, and we show that this implementation outperforms text-only
and network-only versions in two very different datasets involving
community-level decision-making: the Wikipedia Requests for Adminship corpus
and the Convote U.S. Congressional speech corpus
- âŠ