“Embed, embed! There’s knocking at the gate.”

Burghardt, Manuel; Liebl, Bernhard

“Embed, embed! There’s knocking at the gate.”

Authors: Manuel Burghardt
Bernhard Liebl
Publication date: 30 May 2024
Publisher: Zeitschrift für digitale Geisteswissenschaften

Abstract

The detection of intertextual references in text corpora is a digital humanities topic that has gained a lot of attention in recent years. While intertextuality – from a literary studies perspective – describes the phenomenon of one text being present in another text, the computational problem at hand is the task of text similarity detection, and more concretely, semantic similarity detection. In this notebook, we introduce the Vectorian as a framework to build queries through word embeddings such as fastText and GloVe. We evaluate the influence of computing document similarity through alignments such as Waterman-Smith-Beyer and two variants of Word Mover’s Distance. We also investigate the performance of state-of-art sentence embeddings like Siamese BERT networks for the task - both as document embeddings and as contextual token embeddings. Overall, we find that Waterman-Smith-Beyer with fastText offers highly competitive performance. The notebook can also be used to upload new data for performing custom search queries

Similar works

Full text

Available Versions

Qucosa - Publikationsserver der Universität Leipzig

oai:qucosa:de:qucosa:91774

Last time updated on 09/06/2024