63 research outputs found

    Boosting intelligence analysts’ judgment accuracy: what works, what fails?

    Get PDF
    A routine part of intelligence analysis is judging the probability of alternative hypotheses given available evidence. Intelligence organizations advise analysts to use intelligence-tradecraft methods such as Analysis of Competing Hypotheses (ACH) to improve judgment, but such methods have not been rigorously tested. We compared the evidence evaluation and judgment accuracy of a group of intelligence analysts who were recently trained in ACH and then used it on a probability judgment task to another group of analysts from the same cohort that were neither trained in ACH nor asked to use any specific method. Although the ACH group assessed information usefulness better than the control group, the control group was a little more accurate (and coherent) than the ACH group. Both groups, however, exhibited suboptimal judgment and were susceptible to unpacking effects. Although ACH failed to improve accuracy, we found that recalibration and aggregation methods substantially improved accuracy. Specifically, mean absolute error (MAE) in analysts’ probability judgments decreased by 61% after first coherentizing their judgments (a process that ensures judgments respect the unitarity axiom) and then aggregating their judgments. The findings cast doubt on the efficacy of ACH, and show the promise of statistical methods for boosting judgment quality in intelligence and other organizations that routinely produce expert judgments

    Boosting intelligence analysts’ judgment accuracy: what works, what fails?

    Get PDF
    A routine part of intelligence analysis is judging the probability of alternative hypotheses given available evidence. Intelligence organizations advise analysts to use intelligence-tradecraft methods such as Analysis of Competing Hypotheses (ACH) to improve judgment, but such methods have not been rigorously tested. We compared the evidence evaluation and judgment accuracy of a group of intelligence analysts who were recently trained in ACH and then used it on a probability judgment task to another group of analysts from the same cohort that were neither trained in ACH nor asked to use any specific method. Although the ACH group assessed information usefulness better than the control group, the control group was a little more accurate (and coherent) than the ACH group. Both groups, however, exhibited suboptimal judgment and were susceptible to unpacking effects. Although ACH failed to improve accuracy, we found that recalibration and aggregation methods substantially improved accuracy. Specifically, mean absolute error (MAE) in analysts’ probability judgments decreased by 61% after first coherentizing their judgments (a process that ensures judgments respect the unitarity axiom) and then aggregating their judgments. The findings cast doubt on the efficacy of ACH, and show the promise of statistical methods for boosting judgment quality in intelligence and other organizations that routinely produce expert judgments

    Validating a forced‑choice method for eliciting quality‑of‑reasoning judgments

    Get PDF
    In this paper we investigate the criterion validity of forced-choice comparisons of the quality of written arguments with normative solutions. Across two studies, novices and experts assessing quality of reasoning through a forced-choice design were both able to choose arguments supporting more accurate solutions—62.2% (SE = 1%) of the time for novices and 74.4% (SE = 1%) for experts—and arguments produced by larger teams—up to 82% of the time for novices and 85% for experts—with high inter-rater reliability, namely 70.58% (95% CI = 1.18) agreement for novices and 80.98% (95% CI = 2.26) for experts. We also explored two methods for increasing efficiency. We found that the number of comparative judgments needed could be substantially reduced with little accuracy loss by leveraging transitivity and producing quality-of-reasoning assessments using an AVL tree method. Moreover, a regression model trained to predict scores based on automatically derived linguistic features of participants’ judgments achieved a high correlation with the objective accuracy scores of the arguments in our dataset. Despite the inherent subjectivity involved in evaluating differing quality of reasoning, the forced-choice paradigm allows even novice raters to perform beyond chance and can provide a valid, reliable, and efficient method for producing quality-of-reasoning assessments at scale

    Multi-criteria analysis of measures in benchmarking: Dependability benchmarking as a case study

    Full text link
    This is the author’s version of a work that was accepted for publication in The Journal of Systems and Software. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Multi-criteria analysis of measures in benchmarking: Dependability benchmarking as a case study. Journal of Systems and Software, 111, 2016. DOI 10.1016/j.jss.2015.08.052.Benchmarks enable the comparison of computer-based systems attending to a variable set of criteria, such as dependability, security, performance, cost and/or power consumption. It is not despite its difficulty, but rather its mathematical accuracy that multi-criteria analysis of results remains today a subjective process rarely addressed in an explicit way in existing benchmarks. It is thus not surprising that industrial benchmarks only rely on the use of a reduced set of easy-to-understand measures, specially when considering complex systems. This is a way to keep the process of result interpretation straightforward, unambiguous and accurate. However, it limits at the same time the richness and depth of the analysis process. As a result, the academia prefers to characterize complex systems with a wider set of measures. Marrying the requirements of industry and academia in a single proposal remains a challenge today. This paper addresses this question by reducing the uncertainty of the analysis process using quality (score-based) models. At measure definition time, these models make explicit (i) which are the requirements imposed to each type of measure, that may vary from one context of use to another, and (ii) which is the type, and intensity, of the relation between considered measures. At measure analysis time, they provide a consistent, straightforward and unambiguous method to interpret resulting measures. The methodology and its practical use are illustrated through three different case studies from the dependability benchmarking domain, a domain where various different criteria, including both performance and dependability, are typically considered during analysis of benchmark results.. Although the proposed approach is limited to dependability benchmarks in this document, its usefulness for any type of benchmark seems quite evident attending to the general formulation of the provided solution. © 2015 Elsevier Inc. All rights reserved.This work is partially supported by the Spanish project ARENES (TIN2012-38308-C02-01), ANR French project AMORES (ANR-11-INSE-010), the Intel Doctoral Student Honour Programme 2012, and the "Programa de Ayudas de Investigacion y Desarrollo" (PAID) from the Universitat Politecnica de Valencia.Friginal López, J.; Martínez, M.; De Andrés, D.; Ruiz, J. (2016). Multi-criteria analysis of measures in benchmarking: Dependability benchmarking as a case study. Journal of Systems and Software. 111:105-118. https://doi.org/10.1016/j.jss.2015.08.052S10511811

    Of disasters and dragon kings: a statistical analysis of nuclear power incidents and accidents

    Get PDF
    We perform a statistical study of risk in nuclear energy systems. This study provides and analyzes a data set that is twice the size of the previous best data set on nuclear incidents and accidents, comparing three measures of severity: the industry standard International Nuclear Event Scale, the Nuclear Accident Magnitude Scale of radiation release, and cost in U.S. dollars. The rate of nuclear accidents with cost above 20 MM 2013 USD, per reactor per year, has decreased from the 1970s until the present time. Along the way, the rate dropped significantly after Chernobyl (April 1986) and is expected to be roughly stable around a level of 0.003, suggesting an average of just over one event per year across the current global fleet. The distribution of costs appears to have changed following the Three Mile Island major accident (March 1979). The median cost became approximately 3.5 times smaller, but an extremely heavy tail emerged, being well described by a Pareto distribution with parameter α = 0.5–0.6. For instance, the cost of the two largest events, Chernobyl and Fukushima (March 2011), is equal to nearly five times the sum of the 173 other events. We also document a significant runaway disaster regime in both radiation release and cost data, which we associate with the “dragon-king” phenomenon. Since the major accident at Fukushima (March 2011) occurred recently, we are unable to quantify an impact of the industry response to this disaster. Excluding such improvements, in terms of costs, our range of models suggests that there is presently a 50% chance that (i) a Fukushima event (or larger) occurs every 60–150 years, and (ii) that a Three Mile Island event (or larger) occurs every 10–20 years. Further—even assuming that it is no longer possible to suffer an event more costly than Chernobyl or Fukushima—the expected annual cost and its standard error bracket the cost of a new plant. This highlights the importance of improvements not only immediately following Fukushima, but also deeper improvements to effectively exclude the possibility of “dragon-king” disasters. Finally, we find that the International Nuclear Event Scale (INES) is inconsistent in terms of both cost and radiation released. To be consistent with cost data, the Chernobyl and Fukushima disasters would need to have between an INES level of 10 and 11, rather than the maximum of 7

    Loss and Damage in the Rapidly Changing Arctic

    Get PDF
    Arctic climate change is happening much faster than the global average. Arctic change also has global consequences, in addition to local ones. Scientific evidence shows that meltwater of Arctic sources contributes to sea-level rise significantly while accounting for 35% of current global sea-level rise. Arctic communities have to find ways to deal with rapidly changing environmental conditions that are leading to social impacts such as outmigration, similarly to the global South. International debates on Loss and Damage have not addressed the Arctic so far. We review literature to show what impacts of climate change are already visible in the Arctic, and present local cases in order to provide empirical evidence of losses and damages in the Arctic region. This evidence is particularly well presented in the context of outmigration and relocation of which we highlight examples. The review reveals a need for new governance mechanisms and institutional frameworks to tackle Loss and Damage. Finally, we discuss what implications Arctic losses and damages have for the international debate

    Coherence of probability judgments from uncertain evidence: Does ACH help?

    No full text
    Although the Analysis of Competing Hypotheses method (ACH) is a structured analytic technique promoted in several intelligence communities for improving the quality of probabilistic hypothesis testing, it has received little empirical testing. Whereas previous evaluations have used numerical evidence assumed to be perfectly credible, in the present experiment we tested the effectiveness of ACH using a judgment task that presented participants with uncertain evidence varying in source reliability and information credibility. Participants (N = 227) assigned probabilities to two alternative hypotheses across six cases that systematically varied case features. Across multiple tests of coherence, the ACH group showed no advantage over a no-technique control group. Both groups showed evidence of subadditivity, unreliability (which was significantly worse in the ACH group), and overly conservative non-Bayesian judgments. The ACH group also showed pseudo-diagnostic weighting of evidence. The findings do not support the claim that ACH is effective at improving probabilistic judgment
    • …
    corecore