26 research outputs found
Why Should This Article Be Deleted? Transparent Stance Detection in Multilingual Wikipedia Editor Discussions
The moderation of content on online platforms is usually non-transparent. On
Wikipedia, however, this discussion is carried out publicly and the editors are
encouraged to use the content moderation policies as explanations for making
moderation decisions. Currently, only a few comments explicitly mention those
policies -- 20% of the English ones, but as few as 2% of the German and Turkish
comments. To aid in this process of understanding how content is moderated, we
construct a novel multilingual dataset of Wikipedia editor discussions along
with their reasoning in three languages. The dataset contains the stances of
the editors (keep, delete, merge, comment), along with the stated reason, and a
content moderation policy, for each edit decision. We demonstrate that stance
and corresponding reason (policy) can be predicted jointly with a high degree
of accuracy, adding transparency to the decision-making process. We release
both our joint prediction models and the multilingual content moderation
dataset for further research on automated transparent content moderation.Comment: This submission has been accepted to 2023 Conference on Empirical
Methods in Natural Language Processing (EMNLP 2023
Thorny Roses: Investigating the Dual Use Dilemma in Natural Language Processing
Dual use, the intentional, harmful reuse of technology and scientific
artefacts, is a problem yet to be well-defined within the context of Natural
Language Processing (NLP). However, as NLP technologies continue to advance and
become increasingly widespread in society, their inner workings have become
increasingly opaque. Therefore, understanding dual use concerns and potential
ways of limiting them is critical to minimising the potential harms of research
and development. In this paper, we conduct a survey of NLP researchers and
practitioners to understand the depth and their perspective of the problem as
well as to assess existing available support. Based on the results of our
survey, we offer a definition of dual use that is tailored to the needs of the
NLP community. The survey revealed that a majority of researchers are concerned
about the potential dual use of their research but only take limited action
toward it. In light of the survey results, we discuss the current state and
potential means for mitigating dual use in NLP and propose a checklist that can
be integrated into existing conference ethics-frameworks, e.g., the ACL ethics
checklist
Property Label Stability in Wikidata
International audienceStability in Wikidata's schema is essential for the reuse of its data. In this paper, we analyze the stability of the data based on the changes in labels of properties in six languages. We find that the schema is overall stable, making it a reliable resource for external usage
Non-parametric class completeness estimators for collaborative knowledge graphs — the case of wikidata
Collaborative Knowledge Graph platforms allow humans and automated scripts to collaborate in creating, updating and interlinking entities and facts. To ensure both the completeness of the data as well as a uniform coverage of the different topics, it is crucial to identify underrepresented classes in the Knowledge Graph. In this paper, we tackle this problem by developing statistical techniques for class cardinality estimation in collaborative Knowledge Graph platforms. Our method is able to estimate the completeness of a class—as defined by a schema or ontology—hence can be used to answer questions such as “Does the knowledge base have a complete list of all {Beer Brands—Volcanos—Video Game Consoles}?” As a use-case, we focus on Wikidata, which poses unique challenges in terms of the size of its ontology, the number of users actively populating its graph, and its extremely dynamic nature. Our techniques are derived from species estimation and data-management methodologies, and are applied to the case of graphs and collaborative editing. In our empirical evaluation, we observe that i) the number and frequency of unique class instances drastically influence the performance of an estimator, ii) bursts of inserts cause some estimators to overestimate the true size of the class if they are not properly handled, and iii) one can effectively measure the convergence of a class towards its true size by considering the stability of an estimator against the number of available instances
Multilinguality in knowledge graphs
Content on the web is predominantly in English, which makes it inaccessible to individuals who exclusively speak other languages. Knowledge graphs can store multilingual information, facilitate the creation of multilingual applications, and make these accessible to more language communities. In this thesis, we present studies to assess and improve the state of labels and languages in knowledge graphs and apply multilingual information. We propose ways to use multilingual knowledge graphs to reduce gaps in coverage between languages.We explore the current state of language distribution in knowledge graphs by developing a framework - based on existing standards, frameworks, and guidelines - to measure label and language distribution in knowledge graphs. We apply this framework to a dataset representing the web of data, and to Wikidata. We find that there is a lack of labelling on the web of data, and a bias towards a small set of languages. Due to its multilingual editors, Wikidata has a better distribution of languages in labels. We explore how this knowledge about labels and languages can be used in the domain of question answering. We show that we can apply our framework to the task of ranking and selecting knowledge graphs for a set of user questions A way of overcoming the lack of multilingual information in knowledge graphs is to transliterate and translate knowledge graph labels and aliases. We propose the automatic classification of labels into transliteration or translation in order to train a model for each task. Classification before generation improves results compared to using either a translation- or transliteration-based model on their own. A use case of multilingual labels is the generation of article placeholders for Wikipedia using neural text generation in lower-resourced languages. On the basis of surveys and semi-structured interviews, we show that Wikipedia community members find the placeholder pages, and especially the generated summaries, helpful, and are highly likely to accept and reuse the generated text.<br/
luciekaffee/NumTab: NumTab v0.1
The first released version of NumTab, including the code and a version of the finished dataset
The human face of the web of data: a cross-sectional study of labels
Labels in the web of data are the key element for humans to access the data. We introduce a framework to measure the coverage of information with labels. The framework is based on a set of metrics including completeness, unambiguity, multilinguality, labeled object usage, and monolingual islands. We apply this framework on seven diverse datasets, from the web of data, a collaborative knowledge base, open governmental and GLAM data. We gain an insight into the current state of labels and multilinguality on the web of data. Comparing a set of differently sourced datasets can help data publishers to understand what they can improve and what other ways of collecting and data can be adopted
Using natural language generation to bootstrap missing Wikipedia articles: A human-centric perspective
Nowadays natural language generation (NLG) is used in everything from news reporting and chatbots to social media management. Recent advances in machine learning have made it possible to train NLG systems that seek to achieve human-level performance in text writing and summarisation. In this paper, we propose such a system in the context of Wikipedia and evaluate it with Wikipedia readers and editors. Our solution builds upon the ArticlePlaceholder, a tool used in under-resourced Wikipedia language versions, which displays structured data from the Wikidata knowledge base on empty Wikipedia pages. We train a neural network to generate an introductory sentence from the Wikidata triples shown by the ArticlePlaceholder, and explore how Wikipedia users engage with it. The evaluation, which includes an automatic, a judgement-based, and a task-based component, shows that the summary sentences score well in terms of perceived fluency and appropriateness for Wikipedia, and can help editors bootstrap new articles. It also hints at several potential implications of using NLG solutions in Wikipedia at large, including content quality, trust in technology, and algorithmic transparency