5 research outputs found
PoTeC: A German Naturalistic Eye-tracking-while-reading Corpus
The Potsdam Textbook Corpus (PoTeC) is a naturalistic
eye-tracking-while-reading corpus containing data from 75 participants reading
12 scientific texts. PoTeC is the first naturalistic eye-tracking-while-reading
corpus that contains eye-movements from domain-experts as well as novices in a
within-participant manipulation: It is based on a 2x2x2 fully-crossed factorial
design which includes the participants' level of study and the participants'
discipline of study as between-subject factors and the text domain as a
within-subject factor. The participants' reading comprehension was assessed by
a series of text comprehension questions and their domain knowledge was tested
by text-independent background questions for each of the texts. The materials
are annotated for a variety of linguistic features at different levels. We
envision PoTeC to be used for a wide range of studies including but not limited
to analyses of expert and non-expert reading strategies. The corpus and all the
accompanying data at all stages of the preprocessing pipeline and all code used
to preprocess the data are made available via GitHub:
https://github.com/DiLi-Lab/PoTeC
ScanDL: A Diffusion Model for Generating Synthetic Scanpaths on Texts
Eye movements in reading play a crucial role in psycholinguistic research studying the cognitive mechanisms underlying human language processing. More recently, the tight coupling between eye movements and cognition has also been leveraged for language-related machine learning tasks such as the interpretability, enhancement, and pre-training of language models, as well as the inference of reader- and text-specific properties. However, scarcity of eye movement data and its unavailability at application time poses a major challenge for this line of research. Initially, this problem was tackled by resorting to cognitive models for synthesizing eye movement data. However, for the sole purpose of generating human-like scanpaths, purely data-driven machine-learning-based methods have proven to be more suitable. Following recent advances in adapting diffusion processes to discrete data, we propose ScanDL, a novel discrete sequence-to-sequence diffusion model that generates synthetic scanpaths on texts. By leveraging pre-trained word representations and jointly embedding both the stimulus text and the fixation sequence, our model captures multi-modal interactions between the two inputs. We evaluate ScanDL within- and across-dataset and demonstrate that it significantly outperforms state-of-the-art scanpath generation methods. Finally, we provide an extensive psycholinguistic analysis that underlines the model's ability to exhibit human-like reading behavior. Our implementation is made available at https://github.com/DiLi-Lab/ScanDL
ScanDL: A Diffusion Model for Generating Synthetic Scanpaths on Texts
Eye movements in reading play a crucial role in psycholinguistic research
studying the cognitive mechanisms underlying human language processing. More
recently, the tight coupling between eye movements and cognition has also been
leveraged for language-related machine learning tasks such as the
interpretability, enhancement, and pre-training of language models, as well as
the inference of reader- and text-specific properties. However, scarcity of eye
movement data and its unavailability at application time poses a major
challenge for this line of research. Initially, this problem was tackled by
resorting to cognitive models for synthesizing eye movement data. However, for
the sole purpose of generating human-like scanpaths, purely data-driven
machine-learning-based methods have proven to be more suitable. Following
recent advances in adapting diffusion processes to discrete data, we propose
ScanDL, a novel discrete sequence-to-sequence diffusion model that generates
synthetic scanpaths on texts. By leveraging pre-trained word representations
and jointly embedding both the stimulus text and the fixation sequence, our
model captures multi-modal interactions between the two inputs. We evaluate
ScanDL within- and across-dataset and demonstrate that it significantly
outperforms state-of-the-art scanpath generation methods. Finally, we provide
an extensive psycholinguistic analysis that underlines the model's ability to
exhibit human-like reading behavior. Our implementation is made available at
https://github.com/DiLi-Lab/ScanDL.Comment: EMNLP 202