4 research outputs found
Explainable Authorship Verification in Social Media via Attention-based Similarity Learning
Authorship verification is the task of analyzing the linguistic patterns of
two or more texts to determine whether they were written by the same author or
not. The analysis is traditionally performed by experts who consider linguistic
features, which include spelling mistakes, grammatical inconsistencies, and
stylistics for example. Machine learning algorithms, on the other hand, can be
trained to accomplish the same, but have traditionally relied on so-called
stylometric features. The disadvantage of such features is that their
reliability is greatly diminished for short and topically varied social media
texts. In this interdisciplinary work, we propose a substantial extension of a
recently published hierarchical Siamese neural network approach, with which it
is feasible to learn neural features and to visualize the decision-making
process. For this purpose, a new large-scale corpus of short Amazon reviews for
text comparison research is compiled and we show that the Siamese network
topologies outperform state-of-the-art approaches that were built up on
stylometric features. Our linguistic analysis of the internal attention weights
of the network shows that the proposed method is indeed able to latch on to
some traditional linguistic categories.Comment: Accepted for 2019 IEEE International Conference on Big Data (IEEE Big
Data 2019
Self-Calibrating Neural-Probabilistic Model for Authorship Verification Under Covariate Shift
We are addressing two fundamental problems in authorship verification (AV):
Topic variability and miscalibration. Variations in the topic of two disputed
texts are a major cause of error for most AV systems. In addition, it is
observed that the underlying probability estimates produced by deep learning AV
mechanisms oftentimes do not match the actual case counts in the respective
training data. As such, probability estimates are poorly calibrated. We are
expanding our framework from PAN 2020 to include Bayes factor scoring (BFS) and
an uncertainty adaptation layer (UAL) to address both problems. Experiments
with the 2020/21 PAN AV shared task data show that the proposed method
significantly reduces sensitivities to topical variations and significantly
improves the system's calibration.Comment: 12th International Conference of the CLEF Association, 202
O2D2: Out-Of-Distribution Detector to Capture Undecidable Trials in Authorship Verification
The PAN 2021 authorship verification (AV) challenge is part of a three-year
strategy, moving from a cross-topic/closed-set AV task to a
cross-topic/open-set AV task over a collection of fanfiction texts. In this
work, we present a novel hybrid neural-probabilistic framework that is designed
to tackle the challenges of the 2021 task. Our system is based on our 2020
winning submission, with updates to significantly reduce sensitivities to
topical variations and to further improve the system's calibration by means of
an uncertainty-adaptation layer. Our framework additionally includes an
out-of-distribution detector (O2D2) for defining non-responses. Our proposed
system outperformed all other systems that participated in the PAN 2021 AV
task.Comment: PAN@CLEF 202
Variational Autoencoder with Embedded Student- Mixture Model for Authorship Attribution
Traditional computational authorship attribution describes a classification
task in a closed-set scenario. Given a finite set of candidate authors and
corresponding labeled texts, the objective is to determine which of the authors
has written another set of anonymous or disputed texts. In this work, we
propose a probabilistic autoencoding framework to deal with this supervised
classification task. More precisely, we are extending a variational autoencoder
(VAE) with embedded Gaussian mixture model to a Student- mixture model.
Autoencoders have had tremendous success in learning latent representations.
However, existing VAEs are currently still bound by limitations imposed by the
assumed Gaussianity of the underlying probability distributions in the latent
space. In this work, we are extending the Gaussian model for the VAE to a
Student- model, which allows for an independent control of the "heaviness"
of the respective tails of the implied probability densities. Experiments over
an Amazon review dataset indicate superior performance of the proposed method.Comment: Preprin