10,226 research outputs found
A Wikipedia Literature Review
This paper was originally designed as a literature review for a doctoral
dissertation focusing on Wikipedia. This exposition gives the structure of
Wikipedia and the latest trends in Wikipedia research
Collaboratively Patching Linked Data
Today's Web of Data is noisy. Linked Data often needs extensive preprocessing
to enable efficient use of heterogeneous resources. While consistent and valid
data provides the key to efficient data processing and aggregation we are
facing two main challenges: (1st) Identification of erroneous facts and
tracking their origins in dynamically connected datasets is a difficult task,
and (2nd) efforts in the curation of deficient facts in Linked Data are
exchanged rather rarely. Since erroneous data often is duplicated and
(re-)distributed by mashup applications it is not only the responsibility of a
few original publishers to keep their data tidy, but progresses to be a mission
for all distributers and consumers of Linked Data too. We present a new
approach to expose and to reuse patches on erroneous data to enhance and to add
quality information to the Web of Data. The feasibility of our approach is
demonstrated by example of a collaborative game that patches statements in
DBpedia data and provides notifications for relevant changes.Comment: 2nd International Workshop on Usage Analysis and the Web of Data
(USEWOD2012) in the 21st International World Wide Web Conference (WWW2012),
Lyon, France, April 17th, 201
How Many Topics? Stability Analysis for Topic Models
Topic modeling refers to the task of discovering the underlying thematic
structure in a text corpus, where the output is commonly presented as a report
of the top terms appearing in each topic. Despite the diversity of topic
modeling algorithms that have been proposed, a common challenge in successfully
applying these techniques is the selection of an appropriate number of topics
for a given corpus. Choosing too few topics will produce results that are
overly broad, while choosing too many will result in the "over-clustering" of a
corpus into many small, highly-similar topics. In this paper, we propose a
term-centric stability analysis strategy to address this issue, the idea being
that a model with an appropriate number of topics will be more robust to
perturbations in the data. Using a topic modeling approach based on matrix
factorization, evaluations performed on a range of corpora show that this
strategy can successfully guide the model selection process.Comment: Improve readability of plots. Add minor clarification
On Tree-Based Neural Sentence Modeling
Neural networks with tree-based sentence encoders have shown better results
on many downstream tasks. Most of existing tree-based encoders adopt syntactic
parsing trees as the explicit structure prior. To study the effectiveness of
different tree structures, we replace the parsing trees with trivial trees
(i.e., binary balanced tree, left-branching tree and right-branching tree) in
the encoders. Though trivial trees contain no syntactic information, those
encoders get competitive or even better results on all of the ten downstream
tasks we investigated. This surprising result indicates that explicit syntax
guidance may not be the main contributor to the superior performances of
tree-based neural sentence modeling. Further analysis show that tree modeling
gives better results when crucial words are closer to the final representation.
Additional experiments give more clues on how to design an effective tree-based
encoder. Our code is open-source and available at
https://github.com/ExplorerFreda/TreeEnc.Comment: To Appear at EMNLP 201
Semantic Stability in Social Tagging Streams
One potential disadvantage of social tagging systems is that due to the lack
of a centralized vocabulary, a crowd of users may never manage to reach a
consensus on the description of resources (e.g., books, users or songs) on the
Web. Yet, previous research has provided interesting evidence that the tag
distributions of resources may become semantically stable over time as more and
more users tag them. At the same time, previous work has raised an array of new
questions such as: (i) How can we assess the semantic stability of social
tagging systems in a robust and methodical way? (ii) Does semantic
stabilization of tags vary across different social tagging systems and
ultimately, (iii) what are the factors that can explain semantic stabilization
in such systems? In this work we tackle these questions by (i) presenting a
novel and robust method which overcomes a number of limitations in existing
methods, (ii) empirically investigating semantic stabilization processes in a
wide range of social tagging systems with distinct domains and properties and
(iii) detecting potential causes for semantic stabilization, specifically
imitation behavior, shared background knowledge and intrinsic properties of
natural language. Our results show that tagging streams which are generated by
a combination of imitation dynamics and shared background knowledge exhibit
faster and higher semantic stability than tagging streams which are generated
via imitation dynamics or natural language streams alone
- …