65 research outputs found

    Parsel: A (De-)compositional Framework for Algorithmic Reasoning with Language Models

    Full text link
    Despite recent success in large language model (LLM) reasoning, LLMs struggle with hierarchical multi-step reasoning tasks like generating complex programs. For these tasks, humans often start with a high-level algorithmic design and implement each part gradually. We introduce Parsel, a framework enabling automatic implementation and validation of complex algorithms with code LLMs, taking hierarchical function descriptions in natural language as input. We show that Parsel can be used across domains requiring hierarchical reasoning, including program synthesis, robotic planning, and theorem proving. We show that LLMs generating Parsel solve more competition-level problems in the APPS dataset, resulting in pass rates that are over 75% higher than prior results from directly sampling AlphaCode and Codex, while often using a smaller sample budget. We also find that LLM-generated robotic plans using Parsel as an intermediate language are more than twice as likely to be considered accurate than directly generated plans. Lastly, we explore how Parsel addresses LLM limitations and discuss how Parsel may be useful for human programmers.Comment: new quantitative detail

    Hypothesis Search: Inductive Reasoning with Language Models

    Full text link
    Inductive reasoning is a core problem-solving capacity: humans can identify underlying principles from a few examples, which can then be robustly generalized to novel scenarios. Recent work has evaluated large language models (LLMs) on inductive reasoning tasks by directly prompting them yielding "in context learning." This can work well for straightforward inductive tasks, but performs very poorly on more complex tasks such as the Abstraction and Reasoning Corpus (ARC). In this work, we propose to improve the inductive reasoning ability of LLMs by generating explicit hypotheses at multiple levels of abstraction: we prompt the LLM to propose multiple abstract hypotheses about the problem, in natural language, then implement the natural language hypotheses as concrete Python programs. These programs can be directly verified by running on the observed examples and generalized to novel inputs. Because of the prohibitive cost of generation with state-of-the-art LLMs, we consider a middle step to filter the set of hypotheses that will be implemented into programs: we either ask the LLM to summarize into a smaller set of hypotheses, or ask human annotators to select a subset of the hypotheses. We verify our pipeline's effectiveness on the ARC visual inductive reasoning benchmark, its variant 1D-ARC, and string transformation dataset SyGuS. On a random 40-problem subset of ARC, our automated pipeline using LLM summaries achieves 27.5% accuracy, significantly outperforming the direct prompting baseline (accuracy of 12.5%). With the minimal human input of selecting from LLM-generated candidates, the performance is boosted to 37.5%. (And we argue this is a lower bound on the performance of our approach without filtering.) Our ablation studies show that abstract hypothesis generation and concrete program representations are both beneficial for LLMs to perform inductive reasoning tasks

    There is a large disparity between what people see in social media about health research and the underlying strength of evidence

    Get PDF
    Our social media feeds are full of articles shared by friends and family that make claims about how something can prevent a particular health condition. But how robust is the scientific evidence base underpinning these claims? Noah Haber, Alexander Breskin, Ellen Moscoe and Emily R. Smith, on behalf of the CLAIMS team, report on a systematic review of the state of causal inference in media articles and academic studies at the point of consumption on social media. There is a large disparity between what people see in social media about health research compared with the underlying strength of evidence, both in the studies themselves and in the media articles describing their findings. The studies tend to imply stronger causal inference than their methods merit, while media articles reporting on them were found to be further overstated and inaccurate

    List randomization for eliciting HIV status and sexual behaviors in rural KwaZulu-Natal, South Africa: a randomized experiment using known true values for validation

    Get PDF
    Background: List randomization (LR), a survey method intended to mitigate biases related to sensitive true/false questions, has received recent attention from researchers. However, tests of its validity are limited, with no study comparing LR-elicited results with individually known truths. We conducted a test of LR for HIV-related responses in a high HIV prevalence setting in KwaZulu-Natal. By using researcher-known HIV serostatus and HIV test refusal data, we were able to assess how LR and direct questionnaires perform against individual known truth. Methods: Participants were recruited from the participation list from the 2016 round of the Africa Health Research Institute demographic surveillance system, oversampling individuals who were HIV positive. Participants were randomized to two study arms. In Arm A, participants were presented five true/false statements, one of which was the sensitive item, the others non-sensitive. Participants were then asked how many of the five statements they believed were true. In Arm B, participants were asked about each statement individually. LR estimates used data from both arms, while direct estimates were generated from Arm B alone. We compared elicited responses to HIV testing and serostatus data collected through the demographic surveillance system. Results: We enrolled 483 participants, 262 (54%) were randomly assigned to Arm A, and 221 (46%) to Arm B. LR estimated 56% (95% CI: 40 to 72%) of the population to be HIV-negative, compared to 47% (95% CI: 39 to 54%) using direct estimates; the population-estimate of the true value was 32% (95% CI: 28 to 36%). LR estimates yielded HIV test refusal percentages of 55% (95% CI: 37 to 73%) compared to 13% (95% CI: 8 to 17%) by direct estimation, and 15% (95% CI: 12 to 18%) based on observed past behavior. Conclusions: In this context, LR performed poorly when compared to known truth, and did not improve estimates over direct questioning methods when comparing with known truth. These results may reflect difficulties in implementation or comprehension of the LR approach, which is inherently complex. Adjustments to delivery procedures may improve LR’s usefulness. Further investigation of the cognitive processes of participants in answering LR surveys is warranted

    List randomization for eliciting HIV status and sexual behaviors in rural KwaZulu-Natal, South Africa: a randomized experiment using known true values for validation

    Get PDF
    Abstract Background List randomization (LR), a survey method intended to mitigate biases related to sensitive true/false questions, has received recent attention from researchers. However, tests of its validity are limited, with no study comparing LR-elicited results with individually known truths. We conducted a test of LR for HIV-related responses in a high HIV prevalence setting in KwaZulu-Natal. By using researcher-known HIV serostatus and HIV test refusal data, we were able to assess how LR and direct questionnaires perform against individual known truth. Methods Participants were recruited from the participation list from the 2016 round of the Africa Health Research Institute demographic surveillance system, oversampling individuals who were HIV positive. Participants were randomized to two study arms. In Arm A, participants were presented five true/false statements, one of which was the sensitive item, the others non-sensitive. Participants were then asked how many of the five statements they believed were true. In Arm B, participants were asked about each statement individually. LR estimates used data from both arms, while direct estimates were generated from Arm B alone. We compared elicited responses to HIV testing and serostatus data collected through the demographic surveillance system. Results We enrolled 483 participants, 262 (54%) were randomly assigned to Arm A, and 221 (46%) to Arm B. LR estimated 56% (95% CI: 40 to 72%) of the population to be HIV-negative, compared to 47% (95% CI: 39 to 54%) using direct estimates; the population-estimate of the true value was 32% (95% CI: 28 to 36%). LR estimates yielded HIV test refusal percentages of 55% (95% CI: 37 to 73%) compared to 13% (95% CI: 8 to 17%) by direct estimation, and 15% (95% CI: 12 to 18%) based on observed past behavior. Conclusions In this context, LR performed poorly when compared to known truth, and did not improve estimates over direct questioning methods when comparing with known truth. These results may reflect difficulties in implementation or comprehension of the LR approach, which is inherently complex. Adjustments to delivery procedures may improve LR’s usefulness. Further investigation of the cognitive processes of participants in answering LR surveys is warranted

    The worldwide clinical trial research response to the COVID-19 pandemic - the first 100 days

    Get PDF
    Background: Never before have clinical trials drawn as much public attention as those testing interventions for COVID-19. We aimed to describe the worldwide COVID-19 clinical research response and its evolution over the first 100 days of the pandemic. Methods: Descriptive analysis of planned, ongoing or completed trials by April 9, 2020 testing any intervention to treat or prevent COVID-19, systematically identified in trial registries, preprint servers, and literature databases. A survey was conducted of all trials to assess their recruitment status up to July 6, 2020. Results: Most of the 689 trials (overall target sample size 396,366) were small (median sample size 120; interquartile range [IQR] 60-300) but randomized (75.8%; n=522) and were often conducted in China (51.1%; n=352) or the USA (11%; n=76). 525 trials (76.2%) planned to include 155,571 hospitalized patients, and 25 (3.6%) planned to include 96,821 health-care workers. Treatments were evaluated in 607 trials (88.1%), frequently antivirals (n=144) or antimalarials (n=112); 78 trials (11.3%) focused on prevention, including 14 vaccine trials. No trial investigated social distancing. Interventions tested in 11 trials with >5,000 participants were also tested in 169 smaller trials (median sample size 273; IQR 90-700). Hydroxychloroquine alone was investigated in 110 trials. While 414 trials (60.0%) expected completion in 2020, only 35 trials (4.1%; 3,071 participants) were completed by July 6. Of 112 trials with detailed recruitment information, 55 had recruited <20% of the targeted sample; 27 between 20-50%; and 30 over 50% (median 14.8% [IQR 2.0-62.0%]). Conclusions: The size and speed of the COVID-19 clinical trials agenda is unprecedented. However, most trials were small investigating a small fraction of treatment options. The feasibility of this research agenda is questionable, and many trials may end in futility, wasting research resources. Much better coordination is needed to respond to global health threats

    Association between convalescent plasma treatment and mortality in COVID-19: a collaborative systematic review and meta-analysis of randomized clinical trials.

    Get PDF
    Funder: laura and john arnold foundationBACKGROUND: Convalescent plasma has been widely used to treat COVID-19 and is under investigation in numerous randomized clinical trials, but results are publicly available only for a small number of trials. The objective of this study was to assess the benefits of convalescent plasma treatment compared to placebo or no treatment and all-cause mortality in patients with COVID-19, using data from all available randomized clinical trials, including unpublished and ongoing trials (Open Science Framework, https://doi.org/10.17605/OSF.IO/GEHFX ). METHODS: In this collaborative systematic review and meta-analysis, clinical trial registries (ClinicalTrials.gov, WHO International Clinical Trials Registry Platform), the Cochrane COVID-19 register, the LOVE database, and PubMed were searched until April 8, 2021. Investigators of trials registered by March 1, 2021, without published results were contacted via email. Eligible were ongoing, discontinued and completed randomized clinical trials that compared convalescent plasma with placebo or no treatment in COVID-19 patients, regardless of setting or treatment schedule. Aggregated mortality data were extracted from publications or provided by investigators of unpublished trials and combined using the Hartung-Knapp-Sidik-Jonkman random effects model. We investigated the contribution of unpublished trials to the overall evidence. RESULTS: A total of 16,477 patients were included in 33 trials (20 unpublished with 3190 patients, 13 published with 13,287 patients). 32 trials enrolled only hospitalized patients (including 3 with only intensive care unit patients). Risk of bias was low for 29/33 trials. Of 8495 patients who received convalescent plasma, 1997 died (23%), and of 7982 control patients, 1952 died (24%). The combined risk ratio for all-cause mortality was 0.97 (95% confidence interval: 0.92; 1.02) with between-study heterogeneity not beyond chance (I2 = 0%). The RECOVERY trial had 69.8% and the unpublished evidence 25.3% of the weight in the meta-analysis. CONCLUSIONS: Convalescent plasma treatment of patients with COVID-19 did not reduce all-cause mortality. These results provide strong evidence that convalescent plasma treatment for patients with COVID-19 should not be used outside of randomized trials. Evidence synthesis from collaborations among trial investigators can inform both evidence generation and evidence application in patient care
    corecore