13,488 research outputs found
Recommended from our members
Rethinking the Agreement in Human Evaluation Tasks
Human evaluations are broadly thought to be more valuable the higher the inter-annotator agreement. In this paper we examine this idea. We will describe our experiments and analysis within the area of Automatic Question Generation. Our experiments show how annotators diverge in language annotation tasks due to a range of ineliminable factors. For this reason, we believe that annotation schemes for natural language generation tasks that are aimed at evaluating language quality need to be treated with great care. In particular, an unchecked focus on reduction of disagreement among annotators runs the danger of creating generation goals that reward output that is more distant from, rather than closer to, natural human-like language. We conclude the paper by suggesting a new approach to the use of the agreement metrics in natural language generation evaluation tasks
What does validation of cases in electronic record databases mean? The potential contribution of free text
Electronic health records are increasingly used for research. The definition of cases or endpoints often relies on the use of coded diagnostic data, using a pre-selected group of codes. Validation of these cases, as ‘true’ cases of the disease, is crucial. There are, however, ambiguities in what is meant by validation in the context of electronic records. Validation usually implies comparison of a definition against a gold standard of diagnosis and the ability to identify false negatives (‘true’ cases which were not detected) as well as false positives (detected cases which did not have the condition). We argue that two separate concepts of validation are often conflated in existing studies. Firstly, whether the GP thought the patient was suffering from a particular condition (which we term confirmation or internal validation) and secondly, whether the patient really had the condition (external validation). Few studies have the ability to detect false negatives who have not received a diagnostic code. Natural language processing is likely to open up the use of free text within the electronic record which will facilitate both the validation of the coded diagnosis and searching for false negatives
Comparing automatically detected reflective texts with human judgements
This paper reports on the descriptive results of an experiment comparing automatically detected reflective and not-reflective texts against human judgements. Based on the theory of reflective writing assessment and their operationalisation five elements of reflection were defined. For each element of reflection a set of indicators was developed, which automatically annotate texts regarding reflection based on the parameterisation with authoritative texts. Using a large blog corpus 149 texts were retrieved, which were either annotated as reflective or notreflective. An online survey was then used to gather human judgements for these texts. These two data sets were used to compare the quality of the reflection detection algorithm with human judgments. The analysis indicates the expected difference between reflective and not reflective texts
- …