23 research outputs found
Pre-trained language models evaluating themselves - A comparative study
Evaluating generated text received new attention with the introduction of model-based metrics in recent years. These new metrics have a higher correlation with human judgments and seemingly overcome many issues of previous n-gram based metrics from the symbolic age. In this work, we examine the recently introduced metrics BERTScore, BLEURT, NUBIA, MoverScore, and Mark-Evaluate (Petersen). We investigate their sensitivity to different types of semantic deterioration (part of speech drop and negation), word order perturbations, word drop, and the common problem of repetition. No metric showed appropriate behaviour for negation, and further none of them was overall sensitive to the other issues mentioned above
CC-Top: Constrained Clustering for Dynamic Topic Discovery
Research on multi-class text classification of
short texts mainly focuses on supervised (transfer) learning approaches, requiring a finite set
of pre-defined classes which is constant over
time. This work explores deep constrained clustering (CC) as an alternative to supervised learning approaches in a setting with a dynamically
changing number of classes, a task we introduce as dynamic topic discovery (DTD). We
do so by using pairwise similarity constraints
instead of instance-level class labels which allow for a flexible number of classes while exhibiting a competitive performance compared
to supervised approaches. First, we substantiate this through a series of experiments and
show that CC algorithms exhibit a predictive
performance similar to state-of-the-art supervised learning algorithms while requiring less
annotation effort. Second, we demonstrate the
overclustering capabilities of deep CC for detecting topics in short text data sets in the absence of the ground truth class cardinality during model training. Third, we showcase how
these capabilities can be leveraged for the DTD
setting as a step towards dynamic learning over
time. Finally, we release our codebase to nurture further research in this area
Exposure-lag-response associations between lung cancer mortality and radon exposure in German uranium miners.
Exposure-lag-response associations shed light on the duration of pathogenesis for radiation-induced diseases. To investigate such relations for lung cancer mortality in the German uranium miners of the Wismut company, we apply distributed lag non-linear models (DLNMs) which offer a flexible description of the lagged risk response to protracted radon exposure. Exposure-lag functions are implemented with B-Splines in Cox models of proportional hazards. The DLNM approach yielded good agreement of exposure-lag-response surfaces for the German cohort and for the previously studied cohort of American Colorado miners. For both cohorts, a minimum lag of about 2 year for the onset of risk after first exposure explained the data well, but possibly with large uncertainty. Risk estimates from DLNMs were directly compared with estimates from both standard radio-epidemiological models and biologically based mechanistic models. For age > 45 year, all models predict decreasing estimates of the Excess Relative Risk (ERR). However, at younger age, marked differences appear as DLNMs exhibit ERR peaks, which are not detected by the other models. After comparing exposure-responses for biological processes in mechanistic risk models with exposure-responses for hazard ratios in DLNMs, we propose a typical period of 15 year for radon-related lung carcinogenesis. The period covers the onset of radiation-induced inflammation of lung tissue until cancer death. The DLNM framework provides a view on age-risk patterns supplemental to the standard radio-epidemiological approach and to biologically based modeling
ActiveGLAE: A Benchmark for Deep Active Learning with Transformers
Deep active learning (DAL) seeks to reduce annotation costs by enabling the
model to actively query instance annotations from which it expects to learn the
most. Despite extensive research, there is currently no standardized evaluation
protocol for transformer-based language models in the field of DAL. Diverse
experimental settings lead to difficulties in comparing research and deriving
recommendations for practitioners. To tackle this challenge, we propose the
ActiveGLAE benchmark, a comprehensive collection of data sets and evaluation
guidelines for assessing DAL. Our benchmark aims to facilitate and streamline
the evaluation process of novel DAL strategies. Additionally, we provide an
extensive overview of current practice in DAL with transformer-based language
models. We identify three key challenges - data set selection, model training,
and DAL settings - that pose difficulties in comparing query strategies. We
establish baseline results through an extensive set of experiments as a
reference point for evaluating future work. Based on our findings, we provide
guidelines for researchers and practitioners.Comment: Accepted @ ECML PKDD 2023. This is the author's version of the work.
The definitive Version of Record will be published in the Proceedings of ECML
PKDD 202
Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization
Large Language Models (LLMs) have reshaped natural language processing with
their impressive capabilities. Their ever-increasing size, however, raised
concerns about their effective deployment and the need for LLM compressions.
This study introduces the Divergent Token metrics (DTMs), a novel approach for
assessing compressed LLMs, addressing the limitations of traditional perplexity
or accuracy measures that fail to accurately reflect text generation quality.
DTMs focus on token divergence, that allow deeper insights into the subtleties
of model compression, i.p. when evaluating component's impacts individually.
Utilizing the First Divergent Token metric (FDTM) in model sparsification
reveals that a quarter of all attention components can be pruned beyond 90% on
the Llama-2 model family, still keeping SOTA performance. For quantization FDTM
suggests that over 80% of parameters can naively be transformed to int8 without
special outlier management. These evaluations indicate the necessity of
choosing appropriate compressions for parameters individually-and that FDTM can
identify those-while standard metrics result in deteriorated outcomes
How Different Is Stereotypical Bias Across Languages?
Recent studies have demonstrated how to assess the stereotypical bias in
pre-trained English language models. In this work, we extend this branch of
research in multiple different dimensions by systematically investigating (a)
mono- and multilingual models of (b) different underlying architectures with
respect to their bias in (c) multiple different languages. To that end, we make
use of the English StereoSet data set (Nadeem et al., 2021), which we
semi-automatically translate into German, French, Spanish, and Turkish. We find
that it is of major importance to conduct this type of analysis in a
multilingual setting, as our experiments show a much more nuanced picture as
well as notable differences from the English-only analysis. The main takeaways
from our analysis are that mGPT-2 (partly) shows surprising anti-stereotypical
behavior across languages, English (monolingual) models exhibit the strongest
bias, and the stereotypes reflected in the data set are least present in
Turkish models. Finally, we release our codebase alongside the translated data
sets and practical guidelines for the semi-automatic translation to encourage a
further extension of our work to other languages.Comment: Accepted @ "3rd Workshop on Bias and Fairness in AI" (co-located with
ECML PKDD 2023). This is the author's version of the work. The definite
version of record will be published in the proceeding