Eye movements during reading offer insights into both the reader's cognitive
processes and the characteristics of the text that is being read. Hence, the
analysis of scanpaths in reading have attracted increasing attention across
fields, ranging from cognitive science over linguistics to computer science. In
particular, eye-tracking-while-reading data has been argued to bear the
potential to make machine-learning-based language models exhibit a more
human-like linguistic behavior. However, one of the main challenges in modeling
human scanpaths in reading is their dual-sequence nature: the words are ordered
following the grammatical rules of the language, whereas the fixations are
chronologically ordered. As humans do not strictly read from left-to-right, but
rather skip or refixate words and regress to previous words, the alignment of
the linguistic and the temporal sequence is non-trivial. In this paper, we
develop Eyettention, the first dual-sequence model that simultaneously
processes the sequence of words and the chronological sequence of fixations.
The alignment of the two sequences is achieved by a cross-sequence attention
mechanism. We show that Eyettention outperforms state-of-the-art models in
predicting scanpaths. We provide an extensive within- and across-data set
evaluation on different languages. An ablation study and qualitative analysis
support an in-depth understanding of the model's behavior