12 research outputs found
WikiLinkGraphs: A Complete, Longitudinal and Multi-Language Dataset of the Wikipedia Link Networks
Wikipedia articles contain multiple links connecting a subject to other pages
of the encyclopedia. In Wikipedia parlance, these links are called internal
links or wikilinks. We present a complete dataset of the network of internal
Wikipedia links for the largest language editions. The dataset contains
yearly snapshots of the network and spans years, from the creation of
Wikipedia in 2001 to March 1st, 2018. While previous work has mostly focused on
the complete hyperlink graph which includes also links automatically generated
by templates, we parsed each revision of each article to track links appearing
in the main text. In this way we obtained a cleaner network, discarding more
than half of the links and representing all and only the links intentionally
added by editors. We describe in detail how the Wikipedia dumps have been
processed and the challenges we have encountered, including the need to handle
special pages such as redirects, i.e., alternative article titles. We present
descriptive statistics of several snapshots of this network. Finally, we
propose several research opportunities that can be explored using this new
dataset.Comment: 10 pages, 3 figures, 7 tables, LaTeX. Final camera-ready version
accepted at the 13TH International AAAI Conference on Web and Social Media
(ICWSM 2019) - Munich (Germany), 11-14 June 201
Eliciting New Wikipedia Users' Interests via Automatically Mined Questionnaires: For a Warm Welcome, Not a Cold Start
Every day, thousands of users sign up as new Wikipedia contributors. Once
joined, these users have to decide which articles to contribute to, which users
to seek out and learn from or collaborate with, etc. Any such task is a hard
and potentially frustrating one given the sheer size of Wikipedia. Supporting
newcomers in their first steps by recommending articles they would enjoy
editing or editors they would enjoy collaborating with is thus a promising
route toward converting them into long-term contributors. Standard recommender
systems, however, rely on users' histories of previous interactions with the
platform. As such, these systems cannot make high-quality recommendations to
newcomers without any previous interactions -- the so-called cold-start
problem. The present paper addresses the cold-start problem on Wikipedia by
developing a method for automatically building short questionnaires that, when
completed by a newly registered Wikipedia user, can be used for a variety of
purposes, including article recommendations that can help new editors get
started. Our questionnaires are constructed based on the text of Wikipedia
articles as well as the history of contributions by the already onboarded
Wikipedia editors. We assess the quality of our questionnaire-based
recommendations in an offline evaluation using historical data, as well as an
online evaluation with hundreds of real Wikipedia newcomers, concluding that
our method provides cohesive, human-readable questions that perform well
against several baselines. By addressing the cold-start problem, this work can
help with the sustainable growth and maintenance of Wikipedia's diverse editor
community.Comment: Accepted at the 13th International AAAI Conference on Web and Social
Media (ICWSM-2019
Structuring Wikipedia Articles with Section Recommendations
Sections are the building blocks of Wikipedia articles. They enhance
readability and can be used as a structured entry point for creating and
expanding articles. Structuring a new or already existing Wikipedia article
with sections is a hard task for humans, especially for newcomers or less
experienced editors, as it requires significant knowledge about how a
well-written article looks for each possible topic. Inspired by this need, the
present paper defines the problem of section recommendation for Wikipedia
articles and proposes several approaches for tackling it. Our systems can help
editors by recommending what sections to add to already existing or newly
created Wikipedia articles. Our basic paradigm is to generate recommendations
by sourcing sections from articles that are similar to the input article. We
explore several ways of defining similarity for this purpose (based on topic
modeling, collaborative filtering, and Wikipedia's category system). We use
both automatic and human evaluation approaches for assessing the performance of
our recommendation system, concluding that the category-based approach works
best, achieving precision@10 of about 80% in the human evaluation.Comment: SIGIR '18 camera-read
Evolution of wikipedia’s medical content: past, present and future
As one of the most commonly read online sources of medical information, Wikipedia is an influential public health platform. Its medical content, community, collaborations and challenges have been evolving since its creation in 2001, and engagement by the medical community is vital for ensuring its accuracy and completeness. Both the encyclopaedia’s internal metrics as well as external assessments of its quality indicate that its articles are highly variable, but improving. Although content can be edited by anyone, medical articles are primarily written by a core group of medical professionals. Diverse collaborative ventures have enhanced medical article quality and reach, and opportunities for partnerships are more available than ever. Nevertheless, Wikipedia’s medical content and community still face significant challenges, and a socioecological model is used to structure specific recommendations. We propose that the medical community should prioritise the accuracy of biomedical information in the world’s most consulted encyclopaedia
Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia
Wikipedia's contents are based on reliable and published sources. To this
date, relatively little is known about what sources Wikipedia relies on, in
part because extracting citations and identifying cited sources is challenging.
To close this gap, we release Wikipedia Citations, a comprehensive dataset of
citations extracted from Wikipedia. A total of 29.3M citations were extracted
from 6.1M English Wikipedia articles as of May 2020, and classified as being to
books, journal articles or Web contents. We were thus able to extract 4.0M
citations to scholarly publications with known identifiers -- including DOI,
PMC, PMID, and ISBN -- and further equip an extra 261K citations with DOIs from
Crossref. As a result, we find that 6.7% of Wikipedia articles cite at least
one journal article with an associated DOI, and that Wikipedia cites just 2% of
all articles with a DOI currently indexed in the Web of Science. We release our
code to allow the community to extend upon our work and update the dataset in
the future
Examining the Impact of Algorithm Awareness on Wikidata's Recommender System Recoin
The global infrastructure of the Web, designed as an open and transparent
system, has a significant impact on our society. However, algorithmic systems
of corporate entities that neglect those principles increasingly populated the
Web. Typical representatives of these algorithmic systems are recommender
systems that influence our society both on a scale of global politics and
during mundane shopping decisions. Recently, such recommender systems have come
under critique for how they may strengthen existing or even generate new kinds
of biases. To this end, designers and engineers are increasingly urged to make
the functioning and purpose of recommender systems more transparent. Our
research relates to the discourse of algorithm awareness, that reconsiders the
role of algorithm visibility in interface design. We conducted online
experiments with 105 participants using MTurk for the recommender system
Recoin, a gadget for Wikidata. In these experiments, we presented users with
one of a set of three different designs of Recoin's user interface, each of
them exhibiting a varying degree of explainability and interactivity. Our
findings include a positive correlation between comprehension of and trust in
an algorithmic system in our interactive redesign. However, our results are not
conclusive yet, and suggest that the measures of comprehension, fairness,
accuracy and trust are not yet exhaustive for the empirical study of algorithm
awareness. Our qualitative insights provide a first indication for further
measures. Our study participants, for example, were less concerned with the
details of understanding an algorithmic calculation than with who or what is
judging the result of the algorithm.Comment: 10 pages, 7 figure