134 research outputs found
How Contentious Terms About People and Cultures are Used in Linked Open Data
Web resources in linked open data (LOD) are comprehensible to humans through
literal textual values attached to them, such as labels, notes, or comments.
Word choices in literals may not always be neutral. When outdated and
culturally stereotyping terminology is used in literals, they may appear as
offensive to users in interfaces and propagate stereotypes to algorithms
trained on them. We study how frequently and in which literals contentious
terms about people and cultures occur in LOD and whether there are attempts to
mark the usage of such terms. For our analysis, we reuse English and Dutch
terms from a knowledge graph that provides opinions of experts from the
cultural heritage domain about terms' contentiousness. We inspect occurrences
of these terms in four widely used datasets: Wikidata, The Getty Art &
Architecture Thesaurus, Princeton WordNet, and Open Dutch WordNet. Some terms
are ambiguous and contentious only in particular senses. Applying word sense
disambiguation, we generate a set of literals relevant to our analysis. We
found that outdated, derogatory, stereotyping terms frequently appear in
descriptive and labelling literals, such as preferred labels that are usually
displayed in interfaces and used for indexing. In some cases, LOD contributors
mark contentious terms with words and phrases in literals (implicit markers) or
properties linked to resources (explicit markers). However, such marking is
rare and non-consistent in all datasets. Our quantitative and qualitative
insights could be helpful in developing more systematic approaches to address
the propagation of stereotypes via LOD
Exploring concept representations for concept drift detection
We present an approach to estimating concept drift in online news. Our method is to construct temporal concept vectors from topicannotated news articles, and to correlate the distance between the temporal concept vectors with edits to the Wikipedia entries of the concepts. We find improvements in the correlation when we split the news articles based on the amount of articles mentioning a concept, instead of calendar-based units of time
Bias in the analysis of multilingual legislative speech
In this paper we investigate the application of natural language processing tools to the multilingual proceedings of the European Parliament. This work is part of a study in which we explore (1) how subcorpora in different languages may lead to different conclusions about the political landscape, (2) how to determine what a potential language-related bias originates from, and (3) to what extent we can limit or even prevent an unwanted language-bias
A corpus of images and text in online news
In recent years, several datasets have been released that include images and text, giving impulse
to new methods that combine natural language processing and computer vision. However, there is a need for datasets of images in their natural textual context. The ION corpus contains 300K news articles published between August 2014 - 2015 in five online newspapers from two countries. The 1-year coverage over multiple publishers ensures a broad scope in terms of topics, image quality and editorial viewpoints. The corpus consists of JSON-LD files with the following data about each article: the original URL of the article on the news publisherâs website, the date of publication, the headline of the article, the URL of the image displayed with the article (if any), and the caption of that image. Neither the article text nor the images themselves are included in the corpus. Instead, the images are distributed as high-dimensional feature vectors extracted from a Convolutional Neural Network, anticipating their use in computer vision tasks. The article text is represented as a list of automatically generated entity and topic annotations in the form of Wikipedia/DBpedia pages. This facilitates the selection of subsets of the corpus for separate analysis or evaluation
SWISH DataLab: A Web Interface for Data Exploration and Analysis
SWISH DataLab is a single integrated collaborative environment for data processing, exploration and analysis combining Prolog and R. The web interface makes it possible to share the data, the code of all processing steps and the results among researchers; and a versioning system facilitates reproducibility of the research at any chosen point. Using search logs from the National Library of the Netherlands combined with the collection content metadata, we demonstrate how to use SWISH DataLab for all stages of data analysis, using Prolog predicates, graph visualizations, and R
Interchanging lexical resources on the Semantic Web
Lexica and terminology databases play a vital role in many NLP applications, but currently most such resources are published in application-specific formats, or with custom access interfaces, leading to the problem that much of this data is in ââdata silosââ and hence difficult to access. The Semantic Web and in particular the Linked Data initiative provide effective solutions to this problem, as well as possibilities for data reuse by inter-lexicon linking, and incorporation of data categories by dereferencable URIs. The Semantic Web focuses on the use of ontologies to describe semantics on the Web, but currently there is no standard for providing complex lexical information for such ontologies and for describing the relationship between the lexicon and the ontology. We present our model, lemon, which aims to address these gap
- âŠ