Search CORE

4 research outputs found

Recommended from our members

Measuring quality of general reasoning

Author: Barnett Ashley
Dennis Simon
Diamond Michael L.
Kruger Ariel
Marcoci Alexandru
Primoratz Tamar
Rowe Luke
Saletta Morgan
Stone Benjamin
van Gelder Tim
Webb Margaret E.
Publication venue
Publication date: 01/01/2022
Field of study

Machine learning models that automatically assess reasoning quality are trained on human-annotated written products. These “gold-standard” corpora are typically created by prompting annotators to choose, using a forced choice design, which of two products presented side by side is the most convincing, contains the strongest evidence or would be adopted by more people. Despite the increase in popularity of using a forced choice design for assessing quality of reasoning (QoR), no study to date has established the validity and reliability of such a method. In two studies, we simultaneously presented two products of reasoning to participants and asked them to identify which product was ‘better justified’ through a forced choice design. We investigated the criterion validity and inter-rater reliability of the forced choice protocol by assessing the relationship between QoR, measured using the forced choice protocol, and accuracy in objectively answerable problems using naive raters sampled from MTurk (Study 1) and experts (Study 2), respectively. In both studies products that were closer to the correct answer and products generated by larger teams were consistently preferred. Experts were substantially better at picking the reasoning products that corresponded to accurate answers. Perhaps the most surprising finding was just how rapidly raters made judgements regarding reasoning: On average, both novices and experts made reliable decisions in under 15 seconds. We conclude that forced choice is a valid and reliable method of assessing QoR

eScholarship - University of California

University of Dundee Online Publications

Validating a forced‑choice method for eliciting quality‑of‑reasoning judgments

Author: Barnett Ashley
Dennis Simon
Diamond Michael L.
Karvetski Christopher W.
Kruger Ariel
Marcoci Alexandru
Primoratz Tamar
Rowe Luke
Saletta Morgan
Stelmach Margaret E.
Stone Benjamin
Tetlock Philip E.
van Gelder Tim
Publication venue: Springer
Publication date: 01/01/2023
Field of study

In this paper we investigate the criterion validity of forced-choice comparisons of the quality of written arguments with normative solutions. Across two studies, novices and experts assessing quality of reasoning through a forced-choice design were both able to choose arguments supporting more accurate solutions—62.2% (SE = 1%) of the time for novices and 74.4% (SE = 1%) for experts—and arguments produced by larger teams—up to 82% of the time for novices and 85% for experts—with high inter-rater reliability, namely 70.58% (95% CI = 1.18) agreement for novices and 80.98% (95% CI = 2.26) for experts. We also explored two methods for increasing efficiency. We found that the number of comparative judgments needed could be substantially reduced with little accuracy loss by leveraging transitivity and producing quality-of-reasoning assessments using an AVL tree method. Moreover, a regression model trained to predict scores based on automatically derived linguistic features of participants’ judgments achieved a high correlation with the objective accuracy scores of the arguments in our dataset. Despite the inherent subjectivity involved in evaluating differing quality of reasoning, the forced-choice paradigm allows even novice raters to perform beyond chance and can provide a valid, reliable, and efficient method for producing quality-of-reasoning assessments at scale

ACU Research Bank

Comparison of RST and ODNI methods for evaluating quality of reasoning

Author: Ashley Barnett
Luke Thorburn
Simon Dennis
Tamar Primoratz
Tim van Gelder
Publication venue: 'Center for Open Science'
Publication date: 21/05/2020
Field of study

OSF Preprints

Recommended from our members

Validating a forced-choice method for eliciting quality-of-reasoning judgments.

Author: Barnett Ashley
Dennis Simon
Diamond Michael L
Karvetski Christopher W
Kruger Ariel
Marcoci Alexandru
Primoratz Tamar
Rowe Luke
Saletta Morgan
Stone Benjamin
Tetlock Philip E
van Gelder Tim
Webb Margaret E
Publication venue: Behav Res Methods
Publication date: 01/08/2024
Field of study

Acknowledgements: We would like to thank Mark Burgman and three anonymous referees for helpful comments on an earlier version of this paper.In this paper we investigate the criterion validity of forced-choice comparisons of the quality of written arguments with normative solutions. Across two studies, novices and experts assessing quality of reasoning through a forced-choice design were both able to choose arguments supporting more accurate solutions-62.2% (SE = 1%) of the time for novices and 74.4% (SE = 1%) for experts-and arguments produced by larger teams-up to 82% of the time for novices and 85% for experts-with high inter-rater reliability, namely 70.58% (95% CI = 1.18) agreement for novices and 80.98% (95% CI = 2.26) for experts. We also explored two methods for increasing efficiency. We found that the number of comparative judgments needed could be substantially reduced with little accuracy loss by leveraging transitivity and producing quality-of-reasoning assessments using an AVL tree method. Moreover, a regression model trained to predict scores based on automatically derived linguistic features of participants' judgments achieved a high correlation with the objective accuracy scores of the arguments in our dataset. Despite the inherent subjectivity involved in evaluating differing quality of reasoning, the forced-choice paradigm allows even novice raters to perform beyond chance and can provide a valid, reliable, and efficient method for producing quality-of-reasoning assessments at scale

Apollo (Cambridge)