20 research outputs found
Measuring Agreement on Set-valued Items (MASI) for Semantic and Pragmatic Annotation
Annotation projects dealing with complex semantic or pragmatic phenomena face the dilemma of creating annotation schemes that oversimplify the phenomena, or that capture distinctions conventional reliability metrics cannot measure adequately. The solution to the dilemma is to develop metrics that quantify the decisions that annotators are asked to make. This paper discusses MASI, distance metric for comparing sets, and illustrates its use in quantifying the reliability of a specific dataset. Annotations of Summary Content Units (SCUs) generate models referred to as pyramids which can be used to evaluate unseen human summaries or machine summaries. The paper presents reliability results for five pairs of pyramids created for document sets from the 2003 Document Understanding Conference (DUC). The annotators worked independently of each other. Differences between application of MASI to pyramid annotation and its previous application to co-reference annotation are discussed. In addition, it is argued that a paradigmatic reliability study should relate measures of inter-annotator agreement to independent assessments, such as significance tests of the annotated variables with respect to other phenomena. In effect, what counts as sufficiently reliable intera-annotator agreement depends on the use the annotated data will be put to
A repository of data and evaluation resources for natural language generation
Starting in 2007, the field of natural language generation (NLG) has organised shared-task evaluation events every year, under the
Generation Challenges umbrella. In the course of these shared tasks, a wealth of data has been created, along with associated task
definitions and evaluation regimes. In other contexts too, sharable NLG data is now being created. In this paper, we describe the online
repository that we have created as a one-stop resource for obtaining NLG task materials, both from Generation Challenges tasks and
from other sources, where the set of materials provided for each task consists of (i) task definition, (ii) input and output data, (iii)
evaluation software, (iv) documentation, and (v) publications reporting previous results.peer-reviewe
Annotating Social Media Data From Vulnerable Populations: Evaluating Disagreement Between Domain Experts and Graduate Student Annotators
Researchers in computer science have spent considerable time developing methods to increase the accuracy and richness of annotations. However, there is a dearth in research that examines the positionality of the annotator, how they are trained and what we can learn from disagreements between different groups of annotators. In this study, we use qualitative analysis, statistical and computational methods to compare annotations between Chicago-based domain experts and graduate students who annotated a total of 1,851 tweets with images that are a part of a larger corpora associated with the Chicago Gang Intervention Study, which aims to develop a computational system that detects aggression and loss among gang-involved youth in Chicago. We found evidence to support the study of disagreement between annotators and underscore the need for domain expertise when reviewing Twitter data from vulnerable populations. Implications for annotation and content moderation are discussed
Inter-Annotator Agreement in linguistica: una rassegna critica
I coefficienti di Inter-Annotator Agreement sono ampiamente utilizzati in Linguistica Computazionale e NLP per valutare il livello di “affidabilità” delle annotazioni linguistiche. L’articolo propone una breve revisione della letteratura scientifica sull’argomento.Agreement indexes are widely used in Computational Linguistics and NLP to assess the reliability of annotation tasks. The paper aims at reviewing the literature on the topic, illustrating chance-corrected coefficients and their interpretation
Assessing Fine-Grained Explicitness of Song Lyrics
Music plays a crucial role in our lives, with growing consumption and engagement through streaming services and social media platforms. However, caution is needed for children, who may be exposed to explicit content through songs. Initiatives such as the Parental Advisory Label (PAL) and similar labelling from streaming content providers aim to protect children from harmful content. However, so far, the labelling has been limited to tagging the song as explicit (if so), without providing any additional information on the reasons for the explicitness (e.g., strong language, sexual reference). This paper addresses this issue by developing a system capable of detecting explicit song lyrics and assessing the kind of explicit content detected. The novel contributions of the work include (i) a new dataset of 4000 song lyrics annotated with five possible reasons for content explicitness and (ii) experiments with machine learning classifiers to predict explicitness and the reasons for it. The results demonstrated the feasibility of automatically detecting explicit content and the reasons for explicitness in song lyrics. This work is the first to address explicitness at this level of detail and provides a valuable contribution to the music industry, helping to protect children from exposure to inappropriate content
Recommended from our members
Selecting and Categorizing Textual Descriptions of Images in the Context of an Image Indexer's Toolkit
We describe a series of studies aimed at identifying specifications for a text extraction module of an image indexer's toolkit. The materials used in the studies consist of images paired with paragraph sequences that describe the images. We administered a pilot survey to visual resource center professionals at three universities to determine what types of paragraphs would be preferred for metadata selection. Respondents generally showed a strong preference for one of two paragraphs they were presented with, indicating that not all paragraphs that describe images are seen as good sources of metadata. We developed a set of semantic category labels to assign to spans of text in order to distinguish between different types of information about the images, thus to classify metadata contexts. Human agreement on metadata is notoriously variable. In order to maximize agreement, we conducted four human labeling experiments using the seven semantic category labels we developed. A subset of our labelers had much higher inter-annotator reliability, and highest reliability occurs when labelers can pick two labels per text unit
A Taxonomy for Mining and Classifying Privacy Requirements in Issue Reports
Digital and physical footprints are a trail of user activities collected over
the use of software applications and systems. As software becomes ubiquitous,
protecting user privacy has become challenging. With the increasing of user
privacy awareness and advent of privacy regulations and policies, there is an
emerging need to implement software systems that enhance the protection of
personal data processing. However, existing privacy regulations and policies
only provide high-level principles which are difficult for software engineers
to design and implement privacy-aware systems. In this paper, we develop a
taxonomy that provides a comprehensive set of privacy requirements based on two
well-established and widely-adopted privacy regulations and frameworks, the
General Data Protection Regulation (GDPR) and the ISO/IEC 29100. These
requirements are refined into a level that is implementable and easy to
understand by software engineers, thus supporting them to attend to existing
regulations and standards. We have also performed a study on how two large
open-source software projects (Google Chrome and Moodle) address the privacy
requirements in our taxonomy through mining their issue reports. The paper
discusses how the collected issues were classified, and presents the findings
and insights generated from our study.Comment: Submitted to IEEE Transactions on Software Engineering on 23 December
202