146 research outputs found

    Disentangling the Properties of Human Evaluation Methods:A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing

    Get PDF
    Current standards for designing and reporting human evaluations in NLP mean it is generally unclear which evaluations are comparable and can be expected to yield similar results when applied to the same system outputs. This has serious implications for reproducibility testing and meta-evaluation, in particular given that human evaluation is considered the gold standard against which the trustworthiness of automatic metrics is gauged. %and merging others, as well as deciding which evaluations should be able to reproduce each other’s results. Using examples from NLG, we propose a classification system for evaluations based on disentangling (i) what is being evaluated (which aspect of quality), and (ii) how it is evaluated in specific (a) evaluation modes and (b) experimental designs. We show that this approach provides a basis for determining comparability, hence for comparison of evaluations across papers, meta-evaluation experiments, reproducibility testing

    Multilingual Surface Realization Using Universal Dependency Trees

    Get PDF
    We propose a shared task on multilingual SurfaceRealization, i.e., on mapping unorderedand uninflected universal dependency trees tocorrectly ordered and inflected sentences in anumber of languages. A second deeper inputwill be available in which, in addition,functional words, fine-grained PoS and morphologicalinformation will be removed fromthe input trees. The first shared task on SurfaceRealization was carried out in 2011 witha similar setup, with a focus on English. Wethink that it is time for relaunching such ashared task effort in view of the arrival of UniversalDependencies annotated treebanks fora large number of languages on the one hand,and the increasing dominance of Deep Learning,which proved to be a game changer forNLP, on the other hand

    Quantified reproducibility assessment of NLP results

    Get PDF
    This paper describes and tests a method for carrying out quantified reproducibility assessment (QRA) that is based on concepts and definitions from metrology. QRA produces a single score estimating the degree of reproducibility of a given system and evaluation measure, on the basis of the scores from, and differences between, different reproductions. We test QRA on 18 different system and evaluation measure combinations (involving diverse NLP tasks and types of evaluation), for each of which we have the original results and one to seven reproduction results. The proposed QRA method produces degree-of-reproducibility scores that are comparable across multiple reproductions not only of the same, but also of different, original studies. We find that the proposed method facilitates insights into causes of variation between reproductions, and as a result, allows conclusions to be drawn about what aspects of system and/or evaluation design need to be changed in order to improve reproducibility

    The Third Multilingual Surface Realisation Shared Task (SR’20):Overview and Evaluation Results

    Get PDF
    This paper presents results from the Third Shared Task on Multilingual Surface Realisation (SR’20) which was organised as part of the COLING’20 Workshop on Multilingual Surface Realisation. As in SR’18 and SR’19, the shared task comprised two tracks: (1) a Shallow Track where the inputs were full UD structures with word order information removed and tokens lemmatised; and (2) a Deep Track where additionally, functional words and morphological information were removed. Moreover, each track had two subtracks: (a) restricted-resource, where only the data provided or approved as part of a track could be used for training models, and (b) open-resource, where any data could be used. The Shallow Track was offered in 11 languages, whereas the Deep Track in 3 ones. Systems were evaluated using both automatic metrics and direct assessment by human evaluators in terms of Readability and Meaning Similarity to reference outputs. We present the evaluation results, along with descriptions of the SR’19 tracks, data and evaluation methods, as well as brief summaries of the participating systems. For full descriptions of the participating systems, please see the separate system reports elsewhere in this volume

    Comparing Creativity, User-experience and Communicability Linked to Digital Tools during the Fuzzy Phases of Innovation

    Get PDF
    Innovation is defined by a range of activities having different goals but driven by the same purpose. For example, in the ending phases, the aim will be to put forward precise concepts, while upstream of innovation, the activities are defined by the will to investigate the subject and increase the area of knowledge and concepts helpful for the conception of new products. This study takes place in the latter contexts because these are the ones where tools are the most variable and de facto, the least normalised. Our aim was to study user experience felt by the usage of these tools as well as their impact on creativity and ideas’ communicability. To do this, we led an experimental study with 79 participants comparing four tools: pen & paper, Virtual-Reality (VR) drawing, VRCAD, and traditional CAD. Thanks to the UEQ (Laugwitz et al., 2008) and judges method of Cropley and Cropley (2008), we measured the user-exprience and the creativity. Then we compared the level of creativity, user-experience and communicability induced by each tool. The results reveal that the user experience arising from the tool influences the amount and the quality of the ideas. Moreover, we show that the fewer standardises interactions the tools have, the greater the communicability of ideas

    Underspecified Universal Dependency Structures as Inputs for Multilingual Surface Realisation

    Get PDF
    In this paper, we present the datasets used in the Shallow and Deep Tracks of the First Multilingual Surface Realisation Shared Task (SR’18). For the Shallow Track, data in ten languages has been released: Arabic, Czech, Dutch, English, Finnish, French, Italian, Portuguese, Russian and Spanish. For the Deep Track, data in three languages is made available: English, French and Spanish. We describe in detail how the datasets were derived from the Universal Dependencies V2.0, and report on an evaluation of the Deep Track input quality. In addition, we examine the motivation for, and likely usefulness of, deriving NLG inputs from annotations in resources originally developed for Natural Language Understanding (NLU), and assess whether the resulting inputs supply enough information of the right kind for the final stage in the NLG process

    A pipeline for extracting abstract dependency templates for data-to-text natural language generation

    Get PDF
    We present work in progress that aims to address the coverage issue faced by rule based text generators. We propose a pipeline for extracting abstract dependency template(predicate-argument structures)from WikipediatexttobeusedasinputforgeneratingtextfromstructureddatawiththeFORGe system. The pipeline comprises three main components:(i) candidate sentence retrieval, (ii)clause extraction, ranking and selection, and (iii) conversion to predicate-argument form. Wepresentanapproachandpreliminaryevaluationfortherankingandselectionmodule

    Another PASS: a reproduction study of the human evaluation of a football report generation system

    Get PDF
    This paper reports results from a reproduction study in which we repeated the human evaluation of the PASS Dutch-language football report generation system (van der Lee et al., 2017). The work was carried out as part of the ReproGen Shared Task on Reproducibility of Human Evaluations in NLG, in Track A (Paper 1). We aimed to repeat the original study exactly, with the main difference that a different set of evaluators was used. We describe the study design, present the results from the original and the reproduction study, and then compare and analyse the differences between the two sets of results. For the two ‘headline’ results of average Fluency and Clarity, we find that in both studies, the system was rated more highly for Clarity than for Fluency, and Clarity had higher standard deviation. Clarity and Fluency ratings were higher, and their standard deviations lower, in the reproduction study than in the original study by substantial margins. Clarity had a higher degree of reproducibility than Fluency, as measured by the coefficient of variation. Data and code are publicly available
    • 

    corecore