274 research outputs found
Character-level Transformer-based Neural Machine Translation
Neural machine translation (NMT) is nowadays commonly applied at the subword
level, using byte-pair encoding. A promising alternative approach focuses on
character-level translation, which simplifies processing pipelines in NMT
considerably. This approach, however, must consider relatively longer
sequences, rendering the training process prohibitively expensive. In this
paper, we discuss a novel, Transformer-based approach, that we compare, both in
speed and in quality to the Transformer at subword and character levels, as
well as previously developed character-level models. We evaluate our models on
4 language pairs from WMT'15: DE-EN, CS-EN, FI-EN and RU-EN. The proposed novel
architecture can be trained on a single GPU and is 34% percent faster than the
character-level Transformer; still, the obtained results are at least on par
with it. In addition, our proposed model outperforms the subword-level model in
FI-EN and shows close results in CS-EN. To stimulate further research in this
area and close the gap with subword-level NMT, we make all our code and models
publicly available
On the Feasibility of Automated Detection of Allusive Text Reuse
The detection of allusive text reuse is particularly challenging due to the
sparse evidence on which allusive references rely---commonly based on none or
very few shared words. Arguably, lexical semantics can be resorted to since
uncovering semantic relations between words has the potential to increase the
support underlying the allusion and alleviate the lexical sparsity. A further
obstacle is the lack of evaluation benchmark corpora, largely due to the highly
interpretative character of the annotation process. In the present paper, we
aim to elucidate the feasibility of automated allusion detection. We approach
the matter from an Information Retrieval perspective in which referencing texts
act as queries and referenced texts as relevant documents to be retrieved, and
estimate the difficulty of benchmark corpus compilation by a novel
inter-annotator agreement study on query segmentation. Furthermore, we
investigate to what extent the integration of lexical semantic information
derived from distributional models and ontologies can aid retrieving cases of
allusive reuse. The results show that (i) despite low agreement scores, using
manual queries considerably improves retrieval performance with respect to a
windowing approach, and that (ii) retrieval performance can be moderately
boosted with distributional semantics
DHBeNeLux : incubator for digital humanities in Belgium, the Netherlands and Luxembourg
Digital Humanities BeNeLux is a grass roots initiative to foster knowledge networking and dissemination in digital humanities in Belgium, the Netherlands, and Luxembourg. This special issue highlights a selection of the work that was presented at the DHBenelux 2015 Conference by way of anthology for the digital humanities currently being done in the Benelux area and beyond. The introduction describes why this grass roots initiative came about and how DHBenelux is currently supporting community building and knowledge exchange for digital humanities in the Benelux area and how this is integrating regional digital humanities in the larger international digital humanities environment
A challenge for stylometry and authorship attribution methods: Goethe's contributions to the Frankfurter Gelehrte Anzeigen 1772/73
Collaborative authorship in the twelfth century: a stylometric study of Hildegard of Bingen and Guibert of Gembloux
Abstract – Hildegard of Bingen (1098–1179) is one of the most influential female authors of the Middle Ages. From the point of view of computational stylistics, the oeuvre attributed to Hildegard is fascinating. Hildegard dictated her texts to secretaries in Latin, a language of which she did not master all grammatical subtleties. She therefore allowed her scribes to correct her spelling and grammar. Especially Hildegard’s last collaborator, Guibert of Gembloux, seems to have considerably reworked her works during his secretaryship. Whereas her other scribes were only allowed to make superficial linguistic changes, Hildegard would have permitted Guibert to render her language stylistically more elegant. In this article, we focus on two shorter texts: the Visio ad Guibertum missa and Visio de sancto Martino, both of which Hildegard allegedly authored during Guibert’s secretaryship. We analyse a corpus containing the letter collections of Hildegard, Guibert and Bernard of Clairvaux using a number of common stylometric techniques. We discuss our results in the light of the Synergy Hypothesis, suggesting that texts resulting from collaboration can display a style markedly different from that of the collaborating authors. Finally, we demonstrate that Guibert must have reworked the disputed visionary texts allegedly authored by Hildegard to such an extent that style-oriented computational procedures attribute the texts to Guibert
A computational approach to authorship verification of Johann Wolfgang Goethe’s contributions to the Frankfurter gelehrte Anzeigen (1772-73)
Assessing the stylistic properties of neurally generated text in authorship attribution
Recent applications of neural language models have led to an increased
interest in the automatic generation of natural language. However impressive,
the evaluation of neurally generated text has so far remained rather informal
and anecdotal. Here, we present an attempt at the systematic assessment of one
aspect of the quality of neurally generated text. We focus on a specific aspect
of neural language generation: its ability to reproduce authorial writing
styles. Using established models for authorship attribution, we empirically
assess the stylistic qualities of neurally generated text. In comparison to
conventional language models, neural models generate fuzzier text that is
relatively harder to attribute correctly. Nevertheless, our results also
suggest that neurally generated text offers more valuable perspectives for the
augmentation of training data
Warren, Michelle R. 2022. Holy Digital Grail. A Medieval Book on the Internet. Stanford: Stanford University Press. Pp. xiii + 342. ISBN 9781503608009.
Book review of Warren, Michelle R. 2022. Holy Digital Grail. A Medieval Book on the Internet. Stanford: Stanford University Press. Pp. xiii + 342. ISBN 9781503608009
- …
