4,290,534 research outputs found
English → Russian MT evaluation campaign
This paper presents the settings and the result of the ROMIP 2013 MT shared task for the English→Russian language direction. The quality of generated translations was assessed using automatic metrics and human evaluation. We also discuss ways to reduce human evaluation efforts using pairwise sentence comparisons by human judges to simulate sort operations
Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses
Automatically evaluating the quality of dialogue responses for unstructured
domains is a challenging problem. Unfortunately, existing automatic evaluation
metrics are biased and correlate very poorly with human judgements of response
quality. Yet having an accurate automatic evaluation procedure is crucial for
dialogue research, as it allows rapid prototyping and testing of new models
with fewer expensive human evaluations. In response to this challenge, we
formulate automatic dialogue evaluation as a learning problem. We present an
evaluation model (ADEM) that learns to predict human-like scores to input
responses, using a new dataset of human response scores. We show that the ADEM
model's predictions correlate significantly, and at a level much higher than
word-overlap metrics such as BLEU, with human judgements at both the utterance
and system-level. We also show that ADEM can generalize to evaluating dialogue
models unseen during training, an important step for automatic dialogue
evaluation.Comment: ACL 201
Human Errors in Decision Making
The aim of this paper was to identify human errors in decision making process. The study was focused on a research question such as: what could be the human error as a potential of decision failure in evaluation of the alternatives in the process of decision making. Two case studies were selected from the literature and analyzed to find the human errors contribute to decision fail. Then the analysis of human errors was linked with mental models in evaluation of alternative step. The results of the study showed that five human errors occur in the evaluation of alternatives step; ignorance or neglect, overconfidence, underestimate, moral and fail to see, which led to un-achievement of objectivesDecision making process, human errors, mental models, decision fail
How to Evaluate your Question Answering System Every Day and Still Get Real Work Done
In this paper, we report on Qaviar, an experimental automated evaluation
system for question answering applications. The goal of our research was to
find an automatically calculated measure that correlates well with human
judges' assessment of answer correctness in the context of question answering
tasks. Qaviar judges the response by computing recall against the stemmed
content words in the human-generated answer key. It counts the answer correct
if it exceeds agiven recall threshold. We determined that the answer
correctness predicted by Qaviar agreed with the human 93% to 95% of the time.
41 question-answering systems were ranked by both Qaviar and human assessors,
and these rankings correlated with a Kendall's Tau measure of 0.920, compared
to a correlation of 0.956 between human assessors on the same data.Comment: 6 pages, 3 figures, to appear in Proceedings of the Second
International Conference on Language Resources and Evaluation (LREC 2000
Recommended from our members
Rethinking the Agreement in Human Evaluation Tasks
Human evaluations are broadly thought to be more valuable the higher the inter-annotator agreement. In this paper we examine this idea. We will describe our experiments and analysis within the area of Automatic Question Generation. Our experiments show how annotators diverge in language annotation tasks due to a range of ineliminable factors. For this reason, we believe that annotation schemes for natural language generation tasks that are aimed at evaluating language quality need to be treated with great care. In particular, an unchecked focus on reduction of disagreement among annotators runs the danger of creating generation goals that reward output that is more distant from, rather than closer to, natural human-like language. We conclude the paper by suggesting a new approach to the use of the agreement metrics in natural language generation evaluation tasks
- …
