In this paper we present research in which we apply (i) the kind of intrinsic evaluation metrics that are characteristic of current comparative HLT evaluation, and (ii) extrinsic, human task-performance evaluations more in keeping with NLG traditions, to 15 systems implementing a language generation task. We analyse the evaluation results and find that there are no significant correlations between intrinsic and extrinsic evaluation measures for this task.peer-reviewe

46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies

Belz, Anja

Gatt, Albert

OAR@UM

Intrinsic vs. Extrinsic Evaluation Measures forReferring Expression GenerationAnja BelzNatural Language Technology GroupUniversity of BrightonBrighton BN2 4GJ, UKa.s.belz@brighton.ac.ukAlbert GattDepartment of Computing ScienceUniversity of AberdeenAberdeen AB24 3UE, UKa.gatt@abdn.ac.ukAbstractIn this paper we present research in which weapply (i) the kind of intrinsic evaluation met-rics that are characteristic of current compara-tive HLT evaluation, and (ii) extrinsic, humantask-performance evaluations more in keepingwith NLG traditions, to 15 systems implement-ing a language generation task. We analysethe evaluation results and find that there are nosignificant correlations between intrinsic andextrinsic evaluation measures for this task.1 IntroductionIn recent years, NLG evaluation has taken on a morecomparative character. NLG now has comparativeevaluation results for comparable, but independentlydeveloped systems, including results for systemsthat regenerate the Penn Treebank (Langkilde, 2002)and systems that generate weather forecasts (Belzand Reiter, 2006). The growing interest in compar-ative evaluation has also resulted in a tentative in-terest in shared-task evaluation events, which led tothe first such event for NLG (the Attribute Selectionfor Generation of Referring Expressions, or ASGRE,Challenge) in 2007 (Belz and Gatt, 2007), with asecond event (the Referring Expression Generation,or REG, Challenge) currently underway.In HLT in general, comparative evaluations (andshared-task evaluation events in particular) are dom-inated by intrinsic evaluation methodologies, in con-trast to the more extrinsic evaluation traditions ofNLG. In this paper, we present research in which weapplied both intrinsic and extrinsic evaluation meth-ods to the same task, in order to shed light on howthe two correlate for NLG tasks. The results show asurprising lack of correlation between the two typesof measures, suggesting that intrinsic metrics andextrinsic methods may represent two very differentviews of how well a system performs.2 Task, Data and SystemsReferring expression generation (REG) is concernedwith the generation of expressions that describe en-tities in a given piece of discourse. REG researchgoes back at least to the 1980s (Appelt, Grosz, Joshi,McDonald and others), but the field as it is todaywas shaped in particular by Dale and Reiter’s work(Dale, 1989; Dale and Reiter, 1995). REG tends to bedivided into the stages of attribute selection (select-ing properties of entities) and realisation (convert-ing selected properties into word strings). Attributeselection in its standard formulation was the sharedtask in the ASGRE Challenge: given an intended ref-erent (‘target’) and the other domain entities (‘dis-tractors’) each with possible attributes, select a setof attributes for the target referent.The ASGRE data (which is now publicly available)consists of all 780 singular items in the TUNA corpus(Gatt et al., 2007) in two subdomains, consisting ofdescriptions of furniture and people. Each data itemis a paired attribute set (as derived from a human-produced RE) and domain representation (target anddistractor entities represented as possible attributesand values).ASGRE participants were asked to submit the out-puts produced by their systems for an unseen testdata set. The outputs from 15 of these systems,shown in the left column of Table 1, were used inthe experiments reported below. Systems differedin terms of whether they were trainable, performedexhaustive search and hardwired use of certain at-tributes types, among other algorithmic properties(see the ASGRE papers for full details). In the caseof one sytem IS-FBS, a buggy version originally sub-mitted and used for Exp 1 was replaced in Exp 2 bya corrected version; the former is marked by a * inwhat follows.3 Evaluation Methods1. Extrinsic evaluation measures: We conductedtwo task-performance evaluation experiments (thefirst was part of the ASGRE Challenge, the secondis new), in which participants identified the referentdenoted by a description by clicking on a picture ina visual display of target and distractor entities. Toenable subjects to read the outputs of peer systems,we converted them from the attribute-value formatdescribed above to something more readable, usinga simple attribute-to-word converter.Both experiments used a Repeated Latin Squaresdesign, and involved 30 participants and 2,250 indi-vidual trials (see Belz & Gatt (2007) for full details).In Exp 1, subjects were shown the domain onthe same screen as the description. Two depen-dent measures were used: (i) combined reading andidentification time (RIT), measured from the point atwhich the description and pictures appeared on thescreen to the point at which a picture was selectedby mouse-click; and (ii) error rate (ER-1).In Exp 2, subjects first read the description andthen initiated the presentation of domain entities.Three measures were computed: (i) reading time(RT), measured from the presentation of a descrip-tion to the point where a subject requested the pre-sentation of the domain; (ii) identification time (IT),measured from the presentation of the domain to thepoint where a subject clicked on a picture; and (iii)error rate (ER-2).2. REG-specific intrinsic measures: Unique-ness is the proportion of attribute sets generated bya system which identify the referent uniquely (i.e.none of the distractors). Minimality is the propor-tion of attribute sets which are minimal as well asunique (i.e. there is no smaller unique set of at-tributes). These measures were included becausethey are commonly named as desiderata for attributeselection algorithms in the REG field (Dale, 1989).The minimality check used in this paper treats refer-ent type as a simple attribute, as the ASGRE systemstended to do.13. Set-similarity measures: The Dice similaritycoefficient computes the similarity between a peerattribute set A1 and a (human-produced) referenceattribute set A2 as2×|A1∩A2||A1|+|A2| . MASI (Passonneau,2006) is similar but biased in favour of similaritywhere one set is a subset of the other.4. String-similarity measures: In order to applystring-similarity metrics, peer and reference outputswere converted to word-strings by the method de-scribed under 1 above. String-edit distance (SE) isstraightforward Levenshtein distance with a substi-tution cost of 2 and insertion/deletion cost of 1. Wealso used the version of string-edit distance (‘SEB’)of Bangalore et al. (2000) which normalises forlength. BLEU is a precision metric from MT that as-sesses the quality of a peer translation in terms of theproportion of its word n-grams (n ≤ 4 is standard)that it shares with several reference translations. TheNIST MT evaluation metric (Doddington, 2002) is anadaptation of BLEU which gives more importance toless frequent (hence more informative) n-grams. Wealso used two versions of the ROUGE metric (Lin andHovy, 2003), ROUGE-2 and ROUGE-SU4 (based onnon-contiguous, or ‘skip’, n-grams), which were of-ficial scores in the DUC 2005 summarization task.4 ResultsResults for all evaluation measures and all systemsare shown in Table 1. Uniqueness results are notincluded, as all systems scored 100%.We ran univariate analyses of variance (ANOVAs)using SYSTEM as the independent variable (15levels), testing its effect on the extrinsic task-performance measures. For error rate (ER), we useda Kruskal-Wallis ranks test to compare identifica-tion accuracy rates across systems2. The main effectof SYSTEM was significant on RIT (F (14, 2249) =6.401, p < .001), RT (F (14, 2249) = 2.56, p <1As a consequence, the Minimality results we report herelook different from those in the ASGRE report.2A non-paramteric test was more appropriate given the largenumber of zero values in ER proportions, and a high dependencyof variance on the mean.extrinsic REG string-similarity set-similarityRIT RT IT ER-1 ER-2 Min RSU4 R-2 NIST BLEU SE SEB Dice MASICAM-B 2784.80 1309.07 1952.39 9.33 5.33 8.11 .673 .647 2.70 .309 4.42 .307 .620 .403CAM-BU 2659.37 1251.32 1877.95 9.33 4 10.14 .663 .638 2.61 .317 4.23 .359 .630 .420CAM-T 2626.02 1475.31 1978.24 10 5.33 0 .698 .723 3.50 .415 3.67 .496 .725 .560CAM-TU 2572.82 1297.37 1809.04 8.67 4 0 .677 .691 3.28 .407 3.71 .494 .721 .557DIT-DS 2785.40 1304.12 1859.25 10.67 2 0 .651 .679 4.23 .457 3.55 .525 .750 .595GR-FP 2724.56 1382.04 2053.33 8.67 3.33 4.73 .65 .649 3.24 .358 3.87 .441 .689 .480GR-SC 2811.09 1349.05 1899.59 11.33 2 4.73 .644 .644 2.42 .305 4 .431 .671 .466IS-FBN 3570.90 1837.55 2188.92 15.33 6 1.35 .771 .772 4.75 .521 3.15 .438 .770 .601IS-FBS – 1461.45 2181.88 – 7.33 100 .485 .448 2.11 .166 5.53 .089 .368 .182*IS-FBS 4008.99 – – 10 – 39.86 – – – – – – .527 .281IS-IAC 2844.17 1356.15 1973.19 8.67 6 0 .612 .623 3.77 .442 3.43 .559 .746 .597NIL 1960.31 1482.67 1960.31 10 5.33 20.27 .525 .509 3.32 .32 4.12 .447 .625 .477T-AS+ 2652.85 1321.20 1817.30 9.33 4.67 0 .671 .684 2.62 .298 4.24 .37 .660 .452T-AS 2864.93 1229.42 1766.35 10 4.67 0 .683 .692 2.99 .342 4.10 .393 .645 .422T-RS+ 2759.76 1278.01 1814.93 6.67 1.33 0 .677 .697 2.85 .303 4.32 .36 .669 .459T-RS 2514.37 1255.28 1866.94 8.67 4.67 0 .694 .711 3.16 .341 4.18 .383 .655 .432Table 1: Results for all systems and evaluation measures (ER-1 = error rate in Exp 1, ER-2 = error rate in Exp 2). (R =ROUGE; GR = GRAPH; T = TITCH)..01), and IT (F (14, 2249) = 1.93, p < .01). In nei-ther experiment was there a significant effect on ER.Table 2 shows correlations between the automaticevaluation measures and the task-performance mea-sures from Exp 1. RIT and ER-1 are not included be-cause of *IS-FBS in Exp 1 (but see individual resultsbelow). For reasons of space, we refer the reader tothe table for individual correlation results.We also computed correlations between the task-performance measures across the two experiments(leaving out the IS-FBS system). Correlation be-tween RIT and RT was .827**; between RIT and IT.675**; and there was no significant correlation be-tween the error rates. The one difference evidentbetween RT and IT is that ER correlates only with IT(not RT) in Exp 2 (see Table 2).5 DiscussionIn Table 2, the four broad types of metrics we haveinvestigated (task-performance, REG-specific, stringsimilarity, set similarity) are indicated by verticaland horizontal lines. The results within each of theresulting boxes are very homogeneous. There aresignificant (and mostly strong) correlations not onlyamong the string-similarity metrics and among theset-similarities, but also across the two types. Thereare also significant correlations between the threetask-performance measures.However, the correlation figures between the task-performance measures and all others are weak andnot significant. The one exception is the correlationbetween NIST and RT which is actually in the wrongdirection (better NIST implies worse reading times).This is an unambiguous result and it shows clearlythat similarity to human-produced reference texts isnot necessarily indicative of quality as measured byhuman task performance.The emergence of comparative evaluation in NLGraises the broader question of how systems that gen-erate language should be compared. In MT and sum-marisation it is more or less taken as read that morehuman-like language makes for a better system.However, it has not been shown that more human-like outputs result in better performance from an ex-trinsic perspective. Though higher humanlikenessmight be expected to entail better task-performance(here, shorter reading/identification times, lower er-ror), the results of our experiments suggests other-wise.6 ConclusionsOur aim in this paper was to shed light on howthe intrinsic evaluation methodologies that dominatecurrent comparative HLT evaluations correlate withhuman task-performance evaluations more in keep-ing with NLG traditions. We used the data and sys-tems from the recent ASGRE Challenge, and com-pared a total of 17 different evaluation methods for15 different systems all implementing the ASGREtask.Our most striking result is that none of the met-rics that assess humanlikeness correlate with any ofextrinsic REG string-similarity set-similarityRT IT ER-2 Min R-SU4 R-2 NIST BLEU SE SEB Dice MASIRT 1 .8** .46 .18 .10 .05 .54* .39 -.30 .02 .12 .23IT .8** 1 .59* .56* -.24 -.33 .22 .04 .09 -.31 -.28 -.17ER-2 .46 .59* 1 .51 -.29 -.36 .03 -.08 .22 -.34 -.39 -.29Min .18 .56* .51 1 -.76** -.81** -.46 -.66** .79** -.8** -.90** -.79**R-SU4 .10 -.24 -.29 -.76** 1 .98** .45 .63* -.63* .42 .72** .57*R-2 .05 -.33 -.36 -.81** .98** 1 .51 .68** -.69** .53* .78** .65**NIST .54* .22 .03 -.46 .45 .51 1 .94** -.84** .68** .74** .82**BLEU .39 .04 -.08 -.66** .63* .68** .94** 1 -.96** .82** .89** .93**SE -.30 .09 .22 .79** -.63* -.69** -.84** -.96** 1 -.92** -.96** -.97**SEB .02 -.31 -.34 -.8** .42 .53* .68** .82** -.92** 1 .92** .95**Dice .12 -.28 -.39 -.90** .72** .78** .74** .89** -.96** .92** 1 .97**MASI .23 -.17 -.29 -.79** .57* .65** .82** .93** -.97** .95** .97** 1Table 2: Pairwise correlations between all automatic measures and the task-performance results from Exp 2. (∗ =significant at .05; ∗∗ at .01). R = ROUGE.the task-performance measures, while strong corre-lations are observed within the two classes of mea-sures – intrinsic and extrinsic. Somewhat worry-ingly, our results show that a system’s ability to pro-duce human-like outputs, may be completely unre-lated to its effect on human task-performance.Our main conclusions for REG evaluation are thatwe need to be cautious in relying on humanlikenessas a quality criterion, and that we leave extrinsicevaluation behind at our peril as we move towardsmore comparative forms of evaluation.Given that the intrinsic metrics that dominate incompetetive HLT evaluations are not assessed interms of correlation with extrinsic notions of qual-ity, our results sound a more general note of cautionabout using intrinsic measures (and humanlikenessmetrics in particular) without extrinsic validation.AcknowledgmentsWe gratefully acknowledge the contribution made tothe evaluations by the faculty and staff at BrightonUniversity who participated in the identification ex-periments. Thanks are also due to Robert Dale, Keesvan Deemter and Ielka van der Sluis for very helpfulcomments. The biggest contribution was, of course,made by the participants in the ASGRE Challengewho created the systems involved in the evaluations.ReferencesS. Bangalore, O. Rambow, and S. Whittaker. 2000.Evaluation metrics for generation. In Proceedings ofthe 1st International Conference on Natural LanguageGeneration (INLG ’00), pages 1–8.A. Belz and A. Gatt. 2007. The attribute selection forGRE challenge: Overview and evaluation results. InProceedings of the 2nd UCNLG Workshop: LanguageGeneration and Machine Translation (UCNLG+MT),pages 75–83.A. Belz and E. Reiter. 2006. Comparing automatic andhuman evaluation of NLG systems. In Proc. EACL’06,pages 313–320.R. Dale and E. Reiter. 1995. Computational interpreta-tions of the Gricean maxims in the generation of refer-ring expressions. Cognitive Science, 19(2):233–263.R. Dale. 1989. Cooking up referring expressions. InProceedings of the 27th Annual Meeting of the Associ-ation for Computational Linguistics.G. Doddington. 2002. Automatic evaluation of machinetranslation quality using n-gram co-occurrence statis-tics. In Proc. ARPA Workshop on Human LanguageTechnology.A. Gatt, I. van der Sluis, and K. van Deemter. 2007.Evaluating algorithms for the generation of referringexpressions using a balanced corpus. In Proceedingsof the 11th European Workshop on Natural LanguageGeneration (ENLG’07), pages 49–56.I. Langkilde. 2002. An empirical verification of cov-erage and correctness for a general-purpose sentencegenerator. In Proceedings of the 2nd InternationalNatural Language Generation Conference (INLG ’02).C.-Y. Lin and E. Hovy. 2003. Automatic evaluation ofsummaries using n-gram co-occurrence statistics. InProc. HLT-NAACL 2003, pages 71–78.R. Passonneau. 2006. Measuring agreement on set-valued items (MASI) for semantic and pragmatic an-notation. In Proceedings of the 5th Language Re-sources and Evaluation Converence (LREC’06).

Intrinsic vs. extrinsic evaluation measures for referring expression generation

https://www.um.edu.mt/library/oar//bitstream/123456789/22035/1/acl08-eval.pdf

Intrinsic vs. extrinsic evaluation measures for referring expression generation

Abstract

Similar works

Full text

Available Versions

OAR@UM