11,349 research outputs found
Taxonomy and clustering in collaborative systems: the case of the on-line encyclopedia Wikipedia
In this paper we investigate the nature and structure of the relation between
imposed classifications and real clustering in a particular case of a
scale-free network given by the on-line encyclopedia Wikipedia. We find a
statistical similarity in the distributions of community sizes both by using
the top-down approach of the categories division present in the archive and in
the bottom-up procedure of community detection given by an algorithm based on
the spectral properties of the graph. Regardless the statistically similar
behaviour the two methods provide a rather different division of the articles,
thereby signaling that the nature and presence of power laws is a general
feature for these systems and cannot be used as a benchmark to evaluate the
suitability of a clustering method.Comment: 5 pages, 3 figures, epl2 styl
Walking across Wikipedia: a scale-free network model of semantic memory retrieval.
Semantic knowledge has been investigated using both online and offline methods. One common online method is category recall, in which members of a semantic category like "animals" are retrieved in a given period of time. The order, timing, and number of retrievals are used as assays of semantic memory processes. One common offline method is corpus analysis, in which the structure of semantic knowledge is extracted from texts using co-occurrence or encyclopedic methods. Online measures of semantic processing, as well as offline measures of semantic structure, have yielded data resembling inverse power law distributions. The aim of the present study is to investigate whether these patterns in data might be related. A semantic network model of animal knowledge is formulated on the basis of Wikipedia pages and their overlap in word probability distributions. The network is scale-free, in that node degree is related to node frequency as an inverse power law. A random walk over this network is shown to simulate a number of results from a category recall experiment, including power law-like distributions of inter-response intervals. Results are discussed in terms of theories of semantic structure and processing
The Evolution of Wikipedia's Norm Network
Social norms have traditionally been difficult to quantify. In any particular
society, their sheer number and complex interdependencies often limit a
system-level analysis. One exception is that of the network of norms that
sustain the online Wikipedia community. We study the fifteen-year evolution of
this network using the interconnected set of pages that establish, describe,
and interpret the community's norms. Despite Wikipedia's reputation for
\textit{ad hoc} governance, we find that its normative evolution is highly
conservative. The earliest users create norms that both dominate the network
and persist over time. These core norms govern both content and interpersonal
interactions using abstract principles such as neutrality, verifiability, and
assume good faith. As the network grows, norm neighborhoods decouple
topologically from each other, while increasing in semantic coherence. Taken
together, these results suggest that the evolution of Wikipedia's norm network
is akin to bureaucratic systems that predate the information age.Comment: 22 pages, 9 figures. Matches published version. Data available at
http://bit.ly/wiki_nor
Topic Similarity Networks: Visual Analytics for Large Document Sets
We investigate ways in which to improve the interpretability of LDA topic
models by better analyzing and visualizing their outputs. We focus on examining
what we refer to as topic similarity networks: graphs in which nodes represent
latent topics in text collections and links represent similarity among topics.
We describe efficient and effective approaches to both building and labeling
such networks. Visualizations of topic models based on these networks are shown
to be a powerful means of exploring, characterizing, and summarizing large
collections of unstructured text documents. They help to "tease out"
non-obvious connections among different sets of documents and provide insights
into how topics form larger themes. We demonstrate the efficacy and
practicality of these approaches through two case studies: 1) NSF grants for
basic research spanning a 14 year period and 2) the entire English portion of
Wikipedia.Comment: 9 pages; 2014 IEEE International Conference on Big Data (IEEE BigData
2014
Evolution of Wikipedia's Category Structure
Wikipedia, as a social phenomenon of collaborative knowledge creating, has
been studied extensively from various points of views. The category system of
Wikipedia, introduced in 2004, has attracted relatively little attention. In
this study, we focus on the documentation of knowledge, and the transformation
of this documentation with time. We take Wikipedia as a proxy for knowledge in
general and its category system as an aspect of the structure of this
knowledge. We investigate the evolution of the category structure of the
English Wikipedia from its birth in 2004 to 2008. We treat the category system
as if it is a hierarchical Knowledge Organization System, capturing the changes
in the distributions of the top categories. We investigate how the clustering
of articles, defined by the category system, matches the direct link network
between the articles and show how it changes over time. We find the Wikipedia
category network mostly stable, but with occasional reorganization. We show
that the clustering matches the link structure quite well, except short periods
preceding the reorganizations.Comment: Preprint of an article submitted for consideration in Advances in
Complex Systems (2012) http://www.worldscinet.com/acs/, 19 pages, 7 figure
Analysis and Forecasting of Trending Topics in Online Media Streams
Among the vast information available on the web, social media streams capture
what people currently pay attention to and how they feel about certain topics.
Awareness of such trending topics plays a crucial role in multimedia systems
such as trend aware recommendation and automatic vocabulary selection for video
concept detection systems.
Correctly utilizing trending topics requires a better understanding of their
various characteristics in different social media streams. To this end, we
present the first comprehensive study across three major online and social
media streams, Twitter, Google, and Wikipedia, covering thousands of trending
topics during an observation period of an entire year. Our results indicate
that depending on one's requirements one does not necessarily have to turn to
Twitter for information about current events and that some media streams
strongly emphasize content of specific categories. As our second key
contribution, we further present a novel approach for the challenging task of
forecasting the life cycle of trending topics in the very moment they emerge.
Our fully automated approach is based on a nearest neighbor forecasting
technique exploiting our assumption that semantically similar topics exhibit
similar behavior.
We demonstrate on a large-scale dataset of Wikipedia page view statistics
that forecasts by the proposed approach are about 9-48k views closer to the
actual viewing statistics compared to baseline methods and achieve a mean
average percentage error of 45-19% for time periods of up to 14 days.Comment: ACM Multimedia 201
Knowledge-rich Image Gist Understanding Beyond Literal Meaning
We investigate the problem of understanding the message (gist) conveyed by
images and their captions as found, for instance, on websites or news articles.
To this end, we propose a methodology to capture the meaning of image-caption
pairs on the basis of large amounts of machine-readable knowledge that has
previously been shown to be highly effective for text understanding. Our method
identifies the connotation of objects beyond their denotation: where most
approaches to image understanding focus on the denotation of objects, i.e.,
their literal meaning, our work addresses the identification of connotations,
i.e., iconic meanings of objects, to understand the message of images. We view
image understanding as the task of representing an image-caption pair on the
basis of a wide-coverage vocabulary of concepts such as the one provided by
Wikipedia, and cast gist detection as a concept-ranking problem with
image-caption pairs as queries. To enable a thorough investigation of the
problem of gist understanding, we produce a gold standard of over 300
image-caption pairs and over 8,000 gist annotations covering a wide variety of
topics at different levels of abstraction. We use this dataset to
experimentally benchmark the contribution of signals from heterogeneous
sources, namely image and text. The best result with a Mean Average Precision
(MAP) of 0.69 indicate that by combining both dimensions we are able to better
understand the meaning of our image-caption pairs than when using language or
vision information alone. We test the robustness of our gist detection approach
when receiving automatically generated input, i.e., using automatically
generated image tags or generated captions, and prove the feasibility of an
end-to-end automated process
- …