4,290,534 research outputs found

    English → Russian MT evaluation campaign

    Get PDF
    This paper presents the settings and the result of the ROMIP 2013 MT shared task for the English→Russian language direction. The quality of generated translations was assessed using automatic metrics and human evaluation. We also discuss ways to reduce human evaluation efforts using pairwise sentence comparisons by human judges to simulate sort operations

    Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

    Full text link
    Automatically evaluating the quality of dialogue responses for unstructured domains is a challenging problem. Unfortunately, existing automatic evaluation metrics are biased and correlate very poorly with human judgements of response quality. Yet having an accurate automatic evaluation procedure is crucial for dialogue research, as it allows rapid prototyping and testing of new models with fewer expensive human evaluations. In response to this challenge, we formulate automatic dialogue evaluation as a learning problem. We present an evaluation model (ADEM) that learns to predict human-like scores to input responses, using a new dataset of human response scores. We show that the ADEM model's predictions correlate significantly, and at a level much higher than word-overlap metrics such as BLEU, with human judgements at both the utterance and system-level. We also show that ADEM can generalize to evaluating dialogue models unseen during training, an important step for automatic dialogue evaluation.Comment: ACL 201

    Human Errors in Decision Making

    Get PDF
    The aim of this paper was to identify human errors in decision making process. The study was focused on a research question such as: what could be the human error as a potential of decision failure in evaluation of the alternatives in the process of decision making. Two case studies were selected from the literature and analyzed to find the human errors contribute to decision fail. Then the analysis of human errors was linked with mental models in evaluation of alternative step. The results of the study showed that five human errors occur in the evaluation of alternatives step; ignorance or neglect, overconfidence, underestimate, moral and fail to see, which led to un-achievement of objectivesDecision making process, human errors, mental models, decision fail

    How to Evaluate your Question Answering System Every Day and Still Get Real Work Done

    Full text link
    In this paper, we report on Qaviar, an experimental automated evaluation system for question answering applications. The goal of our research was to find an automatically calculated measure that correlates well with human judges' assessment of answer correctness in the context of question answering tasks. Qaviar judges the response by computing recall against the stemmed content words in the human-generated answer key. It counts the answer correct if it exceeds agiven recall threshold. We determined that the answer correctness predicted by Qaviar agreed with the human 93% to 95% of the time. 41 question-answering systems were ranked by both Qaviar and human assessors, and these rankings correlated with a Kendall's Tau measure of 0.920, compared to a correlation of 0.956 between human assessors on the same data.Comment: 6 pages, 3 figures, to appear in Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000
    corecore