23,337 research outputs found
A Framework for Evaluation of Machine Reading Comprehension Gold Standards
Machine Reading Comprehension (MRC) is the task of answering a question over
a paragraph of text. While neural MRC systems gain popularity and achieve
noticeable performance, issues are being raised with the methodology used to
establish their performance, particularly concerning the data design of gold
standards that are used to evaluate them. There is but a limited understanding
of the challenges present in this data, which makes it hard to draw comparisons
and formulate reliable hypotheses. As a first step towards alleviating the
problem, this paper proposes a unifying framework to systematically investigate
the present linguistic features, required reasoning and background knowledge
and factual correctness on one hand, and the presence of lexical cues as a
lower bound for the requirement of understanding on the other hand. We propose
a qualitative annotation schema for the first and a set of approximative
metrics for the latter. In a first application of the framework, we analyse
modern MRC gold standards and present our findings: the absence of features
that contribute towards lexical ambiguity, the varying factual correctness of
the expected answers and the presence of lexical cues, all of which potentially
lower the reading comprehension complexity and quality of the evaluation data.Comment: In Proceedings of the 12th International Conference on Language
Resources and Evaluation (LREC 2020
Eye tracking as an MT evaluation technique
Eye tracking has been used successfully as a technique for measuring cognitive load in reading, psycholinguistics, writing, language acquisition etc. for some time now. Its application as a technique for measuring the reading ease of MT output has not yet, to our knowledge, been tested. We report here on a preliminary study testing the use and validity of an eye tracking methodology as a means of semi-automatically evaluating machine translation output. 50 French machine translated sentences, 25 rated as excellent and 25 rated as poor in an earlier human evaluation, were selected. Ten native speakers of French were instructed to read the MT sentences for comprehensibility. Their eye gaze data were recorded non-invasively using a Tobii 1750 eye tracker. The average gaze time and fixation count were found to be higher for the “bad” sentences, while average fixation duration and pupil dilations were not found to be substantially different for output rated as good and output rated as bad. Comparisons between HTER scores and eye gaze data were also found to correlate well with gaze time and fixation count, but not with pupil dilation and fixation duration. We conclude that the eye tracking data, in particular gaze time and fixation count, correlate reasonably well with human evaluation of MT output but fixation duration and pupil dilation may be less reliable indicators of reading difficulty for MT output. We also conclude that eye tracking has promise as a semi-automatic MT evaluation technique, which does not require bi-lingual knowledge, and which can potentially tap into the end users’ experience of machine translation output
- …