190 research outputs found
Debate Helps Supervise Unreliable Experts
As AI systems are used to answer more difficult questions and potentially
help create new knowledge, judging the truthfulness of their outputs becomes
more difficult and more important. How can we supervise unreliable experts,
which have access to the truth but may not accurately report it, to give
answers that are systematically true and don't just superficially seem true,
when the supervisor can't tell the difference between the two on their own? In
this work, we show that debate between two unreliable experts can help a
non-expert judge more reliably identify the truth. We collect a dataset of
human-written debates on hard reading comprehension questions where the judge
has not read the source passage, only ever seeing expert arguments and short
quotes selectively revealed by 'expert' debaters who have access to the
passage. In our debates, one expert argues for the correct answer, and the
other for an incorrect answer. Comparing debate to a baseline we call
consultancy, where a single expert argues for only one answer which is correct
half of the time, we find that debate performs significantly better, with 84%
judge accuracy compared to consultancy's 74%. Debates are also more efficient,
being 68% of the length of consultancies. By comparing human to AI debaters, we
find evidence that with more skilled (in this case, human) debaters, the
performance of debate goes up but the performance of consultancy goes down. Our
error analysis also supports this trend, with 46% of errors in human debate
attributable to mistakes by the honest debater (which should go away with
increased skill); whereas 52% of errors in human consultancy are due to
debaters obfuscating the relevant evidence from the judge (which should become
worse with increased skill). Overall, these results show that debate is a
promising approach for supervising increasingly capable but potentially
unreliable AI systems.Comment: 84 pages, 13 footnotes, 5 figures, 4 tables, 28 debate transcripts;
data and code at
https://github.com/julianmichael/debate/tree/2023-nyu-experiment
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
We present GPQA, a challenging dataset of 448 multiple-choice questions
written by domain experts in biology, physics, and chemistry. We ensure that
the questions are high-quality and extremely difficult: experts who have or are
pursuing PhDs in the corresponding domains reach 65% accuracy (74% when
discounting clear mistakes the experts identified in retrospect), while highly
skilled non-expert validators only reach 34% accuracy, despite spending on
average over 30 minutes with unrestricted access to the web (i.e., the
questions are "Google-proof"). The questions are also difficult for
state-of-the-art AI systems, with our strongest GPT-4 based baseline achieving
39% accuracy. If we are to use future AI systems to help us answer very hard
questions, for example, when developing new scientific knowledge, we need to
develop scalable oversight methods that enable humans to supervise their
outputs, which may be difficult even if the supervisors are themselves skilled
and knowledgeable. The difficulty of GPQA both for skilled non-experts and
frontier AI systems should enable realistic scalable oversight experiments,
which we hope can help devise ways for human experts to reliably get truthful
information from AI systems that surpass human capabilities.Comment: 28 pages, 5 figures, 7 table
Predicting hospital admission and discharge with symptom or function scores in patients with schizophrenia: pooled analysis of a clinical trial extension
<p>Abstract</p> <p>Background</p> <p>The purpose of this analysis was to evaluate relationships between hospital admission or discharge and scores for symptom or functioning in patients with schizophrenia.</p> <p>Methods</p> <p>Data were from three 52-week open-label extensions of the double-blind pivotal trials of paliperidone extended-release (ER). Symptoms and patient function were measured every 4 weeks using the Personal and Social Performance (PSP) scale and the Positive and Negative Syndrome Scale (PANSS). The intent-to-treat analysis set was defined as open-label patients who had at least one post-baseline PSP and PANSS measurement. Time until first hospitalization was evaluated using the Cox proportional hazard model with categorical time-dependent measures for the PSP (1 to 30, 31 to 70, 71 to 100) or PANSS (< 75, ≥ 75 to < 95, ≥ 95), as well as age, gender, schizophrenia duration, and country. Similar analyses were performed for time to discharge.</p> <p>Results</p> <p>Of the 1,077 enrolled patients, 1,028 (95.5%) met study criteria; of these, 382 (37.2%) were hospitalized at open-label baseline. Compared with patients with PSP ≥ 71 group, the hazard for new hospitalization was 8.351 times greater (<it>P </it>= 0.0001) for patients with the poorest functioning (PSP 1 to 30) and 1.977 times greater (<it>P </it>= 0.0295) for patients with PSP of 31-70 compared to the ≥ 71 group. The hazard for new hospitalization was 5.457 times greater (<it>P </it>< 0.0001) for patients PANSS ≥ 95 and 2.316 times greater (<it>P </it>= 0.0027) for the ≥ 75 to < 95 group compared with the < 75 group. For patients hospitalized at baseline, the PANSS ≥ 95 patients had a discharge hazard that was 0.456 times lower than for the < 75 patients (<it>P </it>< 0.0001). The hazard for discharge was 0.646 times lower (<it>P = </it>0.0012) for the PANSS ≥ 75 to < 95 group compared with the < 75 group. A patient's country was a significant predictor variable, with US patients being admitted and discharged faster.</p> <p>Conclusions</p> <p>Better functioning or being less symptomatic is associated with reduced risk for hospitalization and greater chance for early discharge. Treatments or programs that reduce symptoms or improve function decrease the risk of hospitalization in community patients or increase the chance of discharge for hospitalized patients.</p
Cold atmospheric plasma decontamination of SARS-CoV-2 bioaerosols
Bioaerosols (aerosolized particles with biological origin) are strongly suspected to play a significant role in the transmission of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), especially in closed indoor environments. Thus, control technologies capable of effectively inactivating bioaerosols are urgently needed. In this regard, cold atmospheric pressure plasma (CAP) can represent a suitable option, thanks to its ability to produce reactive species, which can exert antimicrobial action. In this study, results; on the total inactivation of SARS-CoV-2 contained in bioaerosols treated using CAP generated in air are reported, demonstrating the possible use of CAP systems for the control of SARS-CoV-2 diffusion through bioaerosols
Diagnosing Clostridioides difficile infections with molecular diagnostics: multicenter evaluation of revogene C. difficile assay
Clostridioides difficile infections are a significant threat to our healthcare system, and rapid and accurate diagnostics are crucial to implement the necessary infection prevention and control measurements. Nucleic acid amplification tests are such reliable diagnostic tools for the detection of toxigenic Clostridioides difficile strains directly from stool specimens. In this multicenter evaluation, we determined the performance of the revogene C. difficile assay. The analysis was conducted on prospective stool specimens collected from six different sites in Europe. The performance of the revogene C. difficile assay was compared to the different routine diagnostic methods and, for a subset of the specimens, against toxigenic culture. In total, 2621 valid stool specimens were tested, and the revogene C. difficile assay displayed a sensitivity/specificity of 97.1% [93.3-99.0] and 98.9% [98.5-99.3] for identification of Clostridioides difficile infection. Discrepancy analysis using additional methods improved this performance to 98.8% [95.8-99.9] and 99.6% [99.2-99.8], respectively. In comparison to toxigenic culture, the revogene C. difficile assay displayed a sensitivity/specificity of 93.0% [86.1-97.1] and 99.5% [98.7-99.9], respectively. These results indicate that the revogene C. difficile assay is a robust and reliable aid in the diagnosis of Clostridioides difficile infections.This article is freely available via Open Access. Click on the Publisher URL to access it via the publisher's site.This study was supported by grants from GenePOC, now part of Meridian Biosciences.published version, accepted versio
Global variations and time trends in the prevalence of childhood myopia, a systematic review and quantitative meta-analysis: implications for aetiology and early prevention.
The aim of this review was to quantify the global variation in childhood myopia prevalence over time taking account of demographic and study design factors. A systematic review identified population-based surveys with estimates of childhood myopia prevalence published by February 2015. Multilevel binomial logistic regression of log odds of myopia was used to examine the association with age, gender, urban versus rural setting and survey year, among populations of different ethnic origins, adjusting for study design factors. 143 published articles (42 countries, 374 349 subjects aged 1-18 years, 74 847 myopia cases) were included. Increase in myopia prevalence with age varied by ethnicity. East Asians showed the highest prevalence, reaching 69% (95% credible intervals (CrI) 61% to 77%) at 15 years of age (86% among Singaporean-Chinese). Blacks in Africa had the lowest prevalence; 5.5% at 15 years (95% CrI 3% to 9%). Time trends in myopia prevalence over the last decade were small in whites, increased by 23% in East Asians, with a weaker increase among South Asians. Children from urban environments have 2.6 times the odds of myopia compared with those from rural environments. In whites and East Asians sex differences emerge at about 9 years of age; by late adolescence girls are twice as likely as boys to be myopic. Marked ethnic differences in age-specific prevalence of myopia exist. Rapid increases in myopia prevalence over time, particularly in East Asians, combined with a universally higher risk of myopia in urban settings, suggest that environmental factors play an important role in myopia development, which may offer scope for prevention
- …