126 research outputs found
Determinants of grader agreement : an analysis of multiple short answer corpora
The ’short answer’ question format is a widely used tool in educational assessment, in which students write one to three sentences in response to an open question. The answers are subsequently rated by expert graders. The agreement between these graders is crucial for reliable analysis, both in terms of educational strategies and in terms of developing automatic models for short answer grading (SAG), an active research topic in NLP. This makes it important to understand the properties that influence grader agreement (such as question difficulty, answer length, and answer correctness). However, the twin challenges towards such an understanding are the wide range of SAG corpora in use (which differ along a number of dimensions) and the hierarchical structure of potentially relevant properties (which can be located at the corpus, answer, or question levels). This article uses generalized mixed effects models to analyze the effect of various such properties on grader agreement in six major SAG corpora for two main assessment tasks (language and content assessment). Overall, we find broad agreement among corpora, with a number of properties behaving similarly across corpora (e.g., shorter answers and correct answers are easier to grade). Some properties show more corpus-specific behavior (e.g., the question difficulty level), and some corpora are more in line with general tendencies than others. In sum, we obtain a nuanced picture of how the major short answer grading corpora are similar and dissimilar from which we derive suggestions for corpus development and analysis.Projekt DEA
Investigating semantic subspaces of Transformer sentence embeddings through linear structural probing
The question of what kinds of linguistic information are encoded in different
layers of Transformer-based language models is of considerable interest for the
NLP community. Existing work, however, has overwhelmingly focused on word-level
representations and encoder-only language models with the masked-token training
objective. In this paper, we present experiments with semantic structural
probing, a method for studying sentence-level representations via finding a
subspace of the embedding space that provides suitable task-specific pairwise
distances between data-points. We apply our method to language models from
different families (encoder-only, decoder-only, encoder-decoder) and of
different sizes in the context of two tasks, semantic textual similarity and
natural-language inference. We find that model families differ substantially in
their performance and layer dynamics, but that the results are largely
model-size invariant.Comment: Accepted to BlackboxNLP 202
GermEval 2014 Named Entity Recognition Shared Task: Companion Paper
This paper describes the GermEval 2014 Named Entity Recognition (NER) Shared Task workshop at KONVENS. It provides background information on the motivation of this task, the data-set, the evaluation method, and an overview of the participating systems, followed by a discussion of their results. In contrast to previous NER tasks, the GermEval 2014 edition uses an extended tagset to account for derivatives of names and tokens that contain name parts. Further, nested named entities had to be predicted, i.e. names that contain other names. The eleven participating teams employed a wide range of techniques in their systems. The most successful systems used state-of-the- art machine learning methods, combined with some knowledge-based features in hybrid systems
AÂ distributional semantic study on German event nominalizations
AbstractWe present the results of a large-scale corpus-based comparison of two German event nominalization patterns: deverbal nouns in -ung (e.g., die Evaluierung, 'the evaluation') and nominal infinitives (e.g., das Evaluieren, 'the evaluating'). Among the many available event nominalization patterns for German, we selected these two because they are both highly productive and challenging from the semantic point of view. Both patterns are known to keep a tight relation with the event denoted by the base verb, but with different nuances. Our study targets a better understanding of the differences in their semantic import.The key notion of our comparison is that of semantic transparency, and we propose a usage-based characterization of the relationship between derived nominals and their bases. Using methods from distributional semantics, we bring to bear two concrete measures of transparency which highlight different nuances: the first one, cosine, detects nominalizations which are semantically similar to their bases; the second one, distributional inclusion, detects nominalizations which are used in a subset of the contexts of the base verb. We find that only the inclusion measure helps in characterizing the difference between the two types of nominalizations, in relation with the traditionally considered variable of relative frequency (Hay, 2001). Finally, the distributional analysis allows us to frame our comparison in the broader coordinates of the inflection vs. derivation cline
Emotion Ratings: How Intensity, Annotation Confidence and Agreements are Entangled
When humans judge the affective content of texts, they also implicitly assess
the correctness of such judgment, that is, their confidence. We hypothesize
that people's (in)confidence that they performed well in an annotation task
leads to (dis)agreements among each other. If this is true, confidence may
serve as a diagnostic tool for systematic differences in annotations. To probe
our assumption, we conduct a study on a subset of the Corpus of Contemporary
American English, in which we ask raters to distinguish neutral sentences from
emotion-bearing ones, while scoring the confidence of their answers. Confidence
turns out to approximate inter-annotator disagreements. Further, we find that
confidence is correlated to emotion intensity: perceiving stronger affect in
text prompts annotators to more certain classification performances. This
insight is relevant for modelling studies of intensity, as it opens the
question wether automatic regressors or classifiers actually predict intensity,
or rather human's self-perceived confidence.Comment: WASSA 2021 at EACL 202
Political claim identification and categorization in a multilingual setting: First experiments
The identification and classification of political claims is an important
step in the analysis of political newspaper reports; however, resources for
this task are few and far between. This paper explores different strategies for
the cross-lingual projection of political claims analysis. We conduct
experiments on a German dataset, DebateNet2.0, covering the policy debate
sparked by the 2015 refugee crisis. Our evaluation involves two tasks (claim
identification and categorization), three languages (German, English, and
French) and two methods (machine translation -- the best method in our
experiments -- and multilingual embeddings).Comment: Presented at KONVENS 2023, Ingolstadt, German
Constraining Linear-chain CRFs to Regular Languages
A major challenge in structured prediction is to represent the
interdependencies within output structures. When outputs are structured as
sequences, linear-chain conditional random fields (CRFs) are a widely used
model class which can learn \textit{local} dependencies in the output. However,
the CRF's Markov assumption makes it impossible for CRFs to represent
distributions with \textit{nonlocal} dependencies, and standard CRFs are unable
to respect nonlocal constraints of the data (such as global arity constraints
on output labels). We present a generalization of CRFs that can enforce a broad
class of constraints, including nonlocal ones, by specifying the space of
possible output structures as a regular language . The resulting
regular-constrained CRF (RegCCRF) has the same formal properties as a standard
CRF, but assigns zero probability to all label sequences not in .
Notably, RegCCRFs can incorporate their constraints during training, while
related models only enforce constraints during decoding. We prove that
constrained training is never worse than constrained decoding, and show
empirically that it can be substantially better in practice. Additionally, we
demonstrate a practical benefit on downstream tasks by incorporating a RegCCRF
into a deep neural model for semantic role labeling, exceeding state-of-the-art
results on a standard dataset
Approximate Attributions for Off-the-Shelf Siamese Transformers
Siamese encoders such as sentence transformers are among the least understood
deep models. Established attribution methods cannot tackle this model class
since it compares two inputs rather than processing a single one. To address
this gap, we have recently proposed an attribution method specifically for
Siamese encoders (M\"oller et al., 2023). However, it requires models to be
adjusted and fine-tuned and therefore cannot be directly applied to
off-the-shelf models. In this work, we reassess these restrictions and propose
(i) a model with exact attribution ability that retains the original model's
predictive performance and (ii) a way to compute approximate attributions for
off-the-shelf models. We extensively compare approximate and exact attributions
and use them to analyze the models' attendance to different linguistic aspects.
We gain insights into which syntactic roles Siamese transformers attend to,
confirm that they mostly ignore negation, explore how they judge semantically
opposite adjectives, and find that they exhibit lexical bias.Comment: Accepted for EACL 2024, St. Julian's, Malt
- …