10,346 research outputs found
How Many Topics? Stability Analysis for Topic Models
Topic modeling refers to the task of discovering the underlying thematic
structure in a text corpus, where the output is commonly presented as a report
of the top terms appearing in each topic. Despite the diversity of topic
modeling algorithms that have been proposed, a common challenge in successfully
applying these techniques is the selection of an appropriate number of topics
for a given corpus. Choosing too few topics will produce results that are
overly broad, while choosing too many will result in the "over-clustering" of a
corpus into many small, highly-similar topics. In this paper, we propose a
term-centric stability analysis strategy to address this issue, the idea being
that a model with an appropriate number of topics will be more robust to
perturbations in the data. Using a topic modeling approach based on matrix
factorization, evaluations performed on a range of corpora show that this
strategy can successfully guide the model selection process.Comment: Improve readability of plots. Add minor clarification
Pushing Your Point of View: Behavioral Measures of Manipulation in Wikipedia
As a major source for information on virtually any topic, Wikipedia serves an
important role in public dissemination and consumption of knowledge. As a
result, it presents tremendous potential for people to promulgate their own
points of view; such efforts may be more subtle than typical vandalism. In this
paper, we introduce new behavioral metrics to quantify the level of controversy
associated with a particular user: a Controversy Score (C-Score) based on the
amount of attention the user focuses on controversial pages, and a Clustered
Controversy Score (CC-Score) that also takes into account topical clustering.
We show that both these measures are useful for identifying people who try to
"push" their points of view, by showing that they are good predictors of which
editors get blocked. The metrics can be used to triage potential POV pushers.
We apply this idea to a dataset of users who requested promotion to
administrator status and easily identify some editors who significantly changed
their behavior upon becoming administrators. At the same time, such behavior is
not rampant. Those who are promoted to administrator status tend to have more
stable behavior than comparable groups of prolific editors. This suggests that
the Adminship process works well, and that the Wikipedia community is not
overwhelmed by users who become administrators to promote their own points of
view
The most controversial topics in Wikipedia: A multilingual and geographical analysis
We present, visualize and analyse the similarities and differences between
the controversial topics related to "edit wars" identified in 10 different
language versions of Wikipedia. After a brief review of the related work we
describe the methods developed to locate, measure, and categorize the
controversial topics in the different languages. Visualizations of the degree
of overlap between the top 100 lists of most controversial articles in
different languages and the content related to geographical locations will be
presented. We discuss what the presented analysis and visualizations can tell
us about the multicultural aspects of Wikipedia and practices of
peer-production. Our results indicate that Wikipedia is more than just an
encyclopaedia; it is also a window into convergent and divergent social-spatial
priorities, interests and preferences.Comment: This is a draft of a book chapter to be published in 2014 by
Scarecrow Press. Please cite as: Yasseri T., Spoerri A., Graham M., and
Kert\'esz J., The most controversial topics in Wikipedia: A multilingual and
geographical analysis. In: Fichman P., Hara N., editors, Global
Wikipedia:International and cross-cultural issues in online collaboration.
Scarecrow Press (2014
Dynamics of conflicts in Wikipedia
In this work we study the dynamical features of editorial wars in Wikipedia
(WP). Based on our previously established algorithm, we build up samples of
controversial and peaceful articles and analyze the temporal characteristics of
the activity in these samples. On short time scales, we show that there is a
clear correspondence between conflict and burstiness of activity patterns, and
that memory effects play an important role in controversies. On long time
scales, we identify three distinct developmental patterns for the overall
behavior of the articles. We are able to distinguish cases eventually leading
to consensus from those cases where a compromise is far from achievable.
Finally, we analyze discussion networks and conclude that edit wars are mainly
fought by few editors only.Comment: Supporting information adde
Highlighting Entanglement of Cultures via Ranking of Multilingual Wikipedia Articles
How different cultures evaluate a person? Is an important person in one
culture is also important in the other culture? We address these questions via
ranking of multilingual Wikipedia articles. With three ranking algorithms based
on network structure of Wikipedia, we assign ranking to all articles in 9
multilingual editions of Wikipedia and investigate general ranking structure of
PageRank, CheiRank and 2DRank. In particular, we focus on articles related to
persons, identify top 30 persons for each rank among different editions and
analyze distinctions of their distributions over activity fields such as
politics, art, science, religion, sport for each edition. We find that local
heroes are dominant but also global heroes exist and create an effective
network representing entanglement of cultures. The Google matrix analysis of
network of cultures shows signs of the Zipf law distribution. This approach
allows to examine diversity and shared characteristics of knowledge
organization between cultures. The developed computational, data driven
approach highlights cultural interconnections in a new perspective.Comment: Published in PLoS ONE
(http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0074554).
Supporting information is available on the same webpag
Semantic Sort: A Supervised Approach to Personalized Semantic Relatedness
We propose and study a novel supervised approach to learning statistical
semantic relatedness models from subjectively annotated training examples. The
proposed semantic model consists of parameterized co-occurrence statistics
associated with textual units of a large background knowledge corpus. We
present an efficient algorithm for learning such semantic models from a
training sample of relatedness preferences. Our method is corpus independent
and can essentially rely on any sufficiently large (unstructured) collection of
coherent texts. Moreover, the approach facilitates the fitting of semantic
models for specific users or groups of users. We present the results of
extensive range of experiments from small to large scale, indicating that the
proposed method is effective and competitive with the state-of-the-art.Comment: 37 pages, 8 figures A short version of this paper was already
published at ECML/PKDD 201
- âŠ