In recent years gaze data has been increasingly used to improve and evaluate NLP
models due to the fact that it carries information about the cognitive processing
of linguistic phenomena. In this paper we
conduct a preliminary study towards the
automatic identification of multiword expressions based on gaze features from native and non-native speakers of English.
We report comparisons between a part-ofspeech (POS) and frequency baseline to:
i) a prediction model based solely on gaze
data and ii) a combined model of gaze
data, POS and frequency. In spite of the
challenging nature of the task, best performance was achieved by the latter. Furthermore, we explore how the type of gaze
data (from native versus non-native speakers) affects the prediction, showing that
data from the two groups is discriminative
to an equal degree. Finally, we show that
late processing measures are more predictive than early ones, which is in line with
previous research on idioms and other formulaic structures.Na