9,493 research outputs found
Building automated vandalism detection tools for Wikidata
Wikidata, like Wikipedia, is a knowledge base that anyone can edit. This open
collaboration model is powerful in that it reduces barriers to participation
and allows a large number of people to contribute. However, it exposes the
knowledge base to the risk of vandalism and low-quality contributions. In this
work, we build on past work detecting vandalism in Wikipedia to detect
vandalism in Wikidata. This work is novel in that identifying damaging changes
in a structured knowledge-base requires substantially different feature
engineering work than in a text-based wiki like Wikipedia. We also discuss the
utility of these classifiers for reducing the overall workload of vandalism
patrollers in Wikidata. We describe a machine classification strategy that is
able to catch 89% of vandalism while reducing patrollers' workload by 98%, by
drawing lightly from contextual features of an edit and heavily from the
characteristics of the user making the edit
Computational Sociolinguistics: A Survey
Language is a social phenomenon and variation is inherent to its social
nature. Recently, there has been a surge of interest within the computational
linguistics (CL) community in the social dimension of language. In this article
we present a survey of the emerging field of "Computational Sociolinguistics"
that reflects this increased interest. We aim to provide a comprehensive
overview of CL research on sociolinguistic themes, featuring topics such as the
relation between language and social identity, language use in social
interaction and multilingual communication. Moreover, we demonstrate the
potential for synergy between the research communities involved, by showing how
the large-scale data-driven methods that are widely used in CL can complement
existing sociolinguistic studies, and how sociolinguistics can inform and
challenge the methods and assumptions employed in CL studies. We hope to convey
the possible benefits of a closer collaboration between the two communities and
conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication:
18th February, 201
Automatic detection of accommodation steps as an indicator of knowledge maturing
Jointly working on shared digital artifacts – such as wikis – is a well-tried method of developing knowledge collectively within a group or organization. Our assumption is that such knowledge maturing is an accommodation process that can be measured by taking the writing process itself into account. This paper describes the development of a tool that detects accommodation automatically with the help of machine learning algorithms. We applied a software framework for task detection to the automatic identification of accommodation processes within a wiki. To set up the learning algorithms and test its performance, we conducted an empirical study, in which participants had to contribute to a wiki and, at the same time, identify their own tasks. Two domain experts evaluated the participants’ micro-tasks with regard to accommodation. We then applied an ontology-based task detection approach that identified accommodation with a rate of 79.12%. The potential use of our tool for measuring knowledge maturing online is discussed
TiFi: Taxonomy Induction for Fictional Domains [Extended version]
Taxonomies are important building blocks of structured knowledge bases, and their construction from text sources and Wikipedia has received much attention. In this paper we focus on the construction of taxonomies for fictional domains, using noisy category systems from fan wikis or text extraction as input. Such fictional domains are archetypes of entity universes that are poorly covered by Wikipedia, such as also enterprise-specific knowledge bases or highly specialized verticals. Our fiction-targeted approach, called TiFi, consists of three phases: (i) category cleaning, by identifying candidate categories that truly represent classes in the domain of interest, (ii) edge cleaning, by selecting subcategory relationships that correspond to class subsumption, and (iii) top-level construction, by mapping classes onto a subset of high-level WordNet categories. A comprehensive evaluation shows that TiFi is able to construct taxonomies for a diverse range of fictional domains such as Lord of the Rings, The Simpsons or Greek Mythology with very high precision and that it outperforms state-of-the-art baselines for taxonomy induction by a substantial margin
- …