Authors writing documents imprint identifying information within their texts:
vocabulary, registry, punctuation, misspellings, or even emoji usage. Finding
these details is very relevant to profile authors, relating back to their
gender, occupation, age, and so on. But most importantly, repeating writing
patterns can help attributing authorship to a text. Previous works use
hand-crafted features or classification tasks to train their authorship models,
leading to poor performance on out-of-domain authors. A better approach to this
task is to learn stylometric representations, but this by itself is an open
research challenge. In this paper, we propose PART: a contrastively trained
model fit to learn \textbf{authorship embeddings} instead of semantics. By
comparing pairs of documents written by the same author, we are able to
determine the proprietary of a text by evaluating the cosine similarity of the
evaluated documents, a zero-shot generalization to authorship identification.
To this end, a pre-trained Transformer with an LSTM head is trained with the
contrastive training method. We train our model on a diverse set of authors,
from literature, anonymous blog posters and corporate emails; a heterogeneous
set with distinct and identifiable writing styles. The model is evaluated on
these datasets, achieving zero-shot 72.39\% and 86.73\% accuracy and top-5
accuracy respectively on the joint evaluation dataset when determining
authorship from a set of 250 different authors. We qualitatively assess the
representations with different data visualizations on the available datasets,
profiling features such as book types, gender, age, or occupation of the
author