77 research outputs found

    DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks

    Full text link
    Large language models (LLMs) have achieved remarkable performance in various evaluation benchmarks. However, concerns are raised about potential data contamination in their considerable volume of training corpus. Moreover, the static nature and fixed complexity of current benchmarks may inadequately gauge the advancing capabilities of LLMs. In this paper, we introduce DyVal, a general and flexible protocol for dynamic evaluation of LLMs. Based on our framework, we build graph-informed DyVal by leveraging the structural advantage of directed acyclic graphs to dynamically generate evaluation samples with controllable complexities. DyVal generates challenging evaluation sets on reasoning tasks including mathematics, logical reasoning, and algorithm problems. We evaluate various LLMs ranging from Flan-T5-large to GPT-3.5-Turbo and GPT-4. Experiments show that LLMs perform worse in DyVal-generated evaluation samples with different complexities, highlighting the significance of dynamic evaluation. We also analyze the failure cases and results of different prompting methods. Moreover, DyVal-generated samples are not only evaluation sets, but also helpful data for fine-tuning to improve the performance of LLMs on existing benchmarks. We hope that DyVal can shed light on future evaluation research of LLMs. Code is available at: https://github.com/microsoft/promptbench.Comment: ICLR 2024 spotlight; 38 pages; code is at aka.ms/dyva

    PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts

    Full text link
    The increasing reliance on Large Language Models (LLMs) across academia and industry necessitates a comprehensive understanding of their robustness to prompts. In response to this vital need, we introduce PromptBench, a robustness benchmark designed to measure LLMs' resilience to adversarial prompts. This study uses a plethora of adversarial textual attacks targeting prompts across multiple levels: character, word, sentence, and semantic. These prompts are then employed in diverse tasks, such as sentiment analysis, natural language inference, reading comprehension, machine translation, and math problem-solving. Our study generates 4,032 adversarial prompts, meticulously evaluated over 8 tasks and 13 datasets, with 567,084 test samples in total. Our findings demonstrate that contemporary LLMs are vulnerable to adversarial prompts. Furthermore, we present comprehensive analysis to understand the mystery behind prompt robustness and its transferability. We then offer insightful robustness analysis and pragmatic recommendations for prompt composition, beneficial to both researchers and everyday users. We make our code, prompts, and methodologies to generate adversarial prompts publicly accessible, thereby enabling and encouraging collaborative exploration in this pivotal field: https://github.com/microsoft/promptbench.Comment: Technical report; 23 pages; code is at: https://github.com/microsoft/promptbenc

    Fine mapping and candidate gene analysis of gynoecy trait in chieh-qua (Benincasa hispida Cogn. var. chieh-qua How)

    Get PDF
    Gynoecy demonstrates an earlier production of hybrids and a higher yield and improves the efficiency of hybrid seed production. Therefore, the utilization of gynoecy is beneficial for the genetic breeding of chieh-qua. However, little knowledge of gynoecious-related genes in chieh-qua has been reported until now. Here, we used an F2 population from the cross between the gynoecious line ‘A36’ and the monoecious line ‘SX’ for genetic mapping and revealed that chieh-qua gynoecy was regulated by a single recessive gene. We fine-mapped it into a 530-kb region flanked by the markers Indel-3 and KASP145 on Chr.8, which harbors eight candidate genes. One of the candidate genes, Bhi08G000345, encoding networked protein 4 (CqNET4), contained a non-synonymous SNP resulting in the amino acid substitution of isoleucine (ATA; I) to methionine (ATG; M). CqNET4 was prominently expressed in the female flower, and only three genes related to ethylene synthesis were significantly expressed between ‘A36’ and ‘SX.’ The results presented here provide support for the CqNET4 as the most likely candidate gene for chieh-qua gynoecy, which differed from the reported gynoecious genes

    Modulatory Effect of Fermented Papaya Extracts on Mammary Gland Hyperplasia Induced by Estrogen and Progestin in Female Rats

    Get PDF
    Fermented papaya extracts (FPEs) are obtained by fermentation of papaya by Aspergillus oryzae and yeasts. In this study, we investigated the protective effects of FPEs on mammary gland hyperplasia induced by estrogen and progestogen. Rats were randomly divided into 6 groups, including a control group, an FPE-alone group, a model group, and three FPE treatment groups (each receiving 30, 15, or 5 ml/kg FPEs). Severe mammary gland hyperplasia was induced upon estradiol benzoate and progestin administration. FPEs could improve the pathological features of the animal model and reduce estrogen levels in the serum. Analysis of oxidant indices revealed that FPEs could increase superoxide dismutase (SOD) and glutathione peroxidase (GSH-Px) activities, decrease malondialdehyde (MDA) level in the mammary glands and serum of the animal models, and decrease the proportion of cells positive for the oxidative DNA damage marker 8-oxo-dG in the mammary glands. Additionally, estradiol benzoate and progestin altered the levels of serum biochemical compounds such as aspartate transaminase (AST), total bilirubin (TBIL), and alanine transaminase (ALT), as well as hepatic oxidant indices such as SOD, GSH-Px, MDA, and 8-oxo-2′-deoxyguanosine (8-oxo-dG). These indices reverted to normal levels upon oral administration of a high dose of FPEs. Taken together, our results indicate that FPEs can protect the mammary glands and other visceral organs from oxidative damage

    The Reproducibility of Lists of Differentially Expressed Genes in Microarray Studies

    Get PDF
    Reproducibility is a fundamental requirement in scientific experiments and clinical contexts. Recent publications raise concerns about the reliability of microarray technology because of the apparent lack of agreement between lists of differentially expressed genes (DEGs). In this study we demonstrate that (1) such discordance may stem from ranking and selecting DEGs solely by statistical significance (P) derived from widely used simple t-tests; (2) when fold change (FC) is used as the ranking criterion, the lists become much more reproducible, especially when fewer genes are selected; and (3) the instability of short DEG lists based on P cutoffs is an expected mathematical consequence of the high variability of the t-values. We recommend the use of FC ranking plus a non-stringent P cutoff as a baseline practice in order to generate more reproducible DEG lists. The FC criterion enhances reproducibility while the P criterion balances sensitivity and specificity

    Cross-platform comparability of microarray technology: Intra-platform consistency and appropriate data analysis procedures are essential

    Get PDF
    BACKGROUND: The acceptance of microarray technology in regulatory decision-making is being challenged by the existence of various platforms and data analysis methods. A recent report (E. Marshall, Science, 306, 630–631, 2004), by extensively citing the study of Tan et al. (Nucleic Acids Res., 31, 5676–5684, 2003), portrays a disturbingly negative picture of the cross-platform comparability, and, hence, the reliability of microarray technology. RESULTS: We reanalyzed Tan's dataset and found that the intra-platform consistency was low, indicating a problem in experimental procedures from which the dataset was generated. Furthermore, by using three gene selection methods (i.e., p-value ranking, fold-change ranking, and Significance Analysis of Microarrays (SAM)) on the same dataset we found that p-value ranking (the method emphasized by Tan et al.) results in much lower cross-platform concordance compared to fold-change ranking or SAM. Therefore, the low cross-platform concordance reported in Tan's study appears to be mainly due to a combination of low intra-platform consistency and a poor choice of data analysis procedures, instead of inherent technical differences among different platforms, as suggested by Tan et al. and Marshall. CONCLUSION: Our results illustrate the importance of establishing calibrated RNA samples and reference datasets to objectively assess the performance of different microarray platforms and the proficiency of individual laboratories as well as the merits of various data analysis procedures. Thus, we are progressively coordinating the MAQC project, a community-wide effort for microarray quality control

    Microarray scanner calibration curves: characteristics and implications

    Get PDF
    BACKGROUND: Microarray-based measurement of mRNA abundance assumes a linear relationship between the fluorescence intensity and the dye concentration. In reality, however, the calibration curve can be nonlinear. RESULTS: By scanning a microarray scanner calibration slide containing known concentrations of fluorescent dyes under 18 PMT gains, we were able to evaluate the differences in calibration characteristics of Cy5 and Cy3. First, the calibration curve for the same dye under the same PMT gain is nonlinear at both the high and low intensity ends. Second, the degree of nonlinearity of the calibration curve depends on the PMT gain. Third, the two PMTs (for Cy5 and Cy3) behave differently even under the same gain. Fourth, the background intensity for the Cy3 channel is higher than that for the Cy5 channel. The impact of such characteristics on the accuracy and reproducibility of measured mRNA abundance and the calculated ratios was demonstrated. Combined with simulation results, we provided explanations to the existence of ratio underestimation, intensity-dependence of ratio bias, and anti-correlation of ratios in dye-swap replicates. We further demonstrated that although Lowess normalization effectively eliminates the intensity-dependence of ratio bias, the systematic deviation from true ratios largely remained. A method of calculating ratios based on concentrations estimated from the calibration curves was proposed for correcting ratio bias. CONCLUSION: It is preferable to scan microarray slides at fixed, optimal gain settings under which the linearity between concentration and intensity is maximized. Although normalization methods improve reproducibility of microarray measurements, they appear less effective in improving accuracy

    The balance of reproducibility, sensitivity, and specificity of lists of differentially expressed genes in microarray studies

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Reproducibility is a fundamental requirement in scientific experiments. Some recent publications have claimed that microarrays are unreliable because lists of differentially expressed genes (DEGs) are not reproducible in similar experiments. Meanwhile, new statistical methods for identifying DEGs continue to appear in the scientific literature. The resultant variety of existing and emerging methods exacerbates confusion and continuing debate in the microarray community on the appropriate choice of methods for identifying reliable DEG lists.</p> <p>Results</p> <p>Using the data sets generated by the MicroArray Quality Control (MAQC) project, we investigated the impact on the reproducibility of DEG lists of a few widely used gene selection procedures. We present comprehensive results from inter-site comparisons using the same microarray platform, cross-platform comparisons using multiple microarray platforms, and comparisons between microarray results and those from TaqMan – the widely regarded "standard" gene expression platform. Our results demonstrate that (1) previously reported discordance between DEG lists could simply result from ranking and selecting DEGs solely by statistical significance (<it>P</it>) derived from widely used simple <it>t</it>-tests; (2) when fold change (FC) is used as the ranking criterion with a non-stringent <it>P</it>-value cutoff filtering, the DEG lists become much more reproducible, especially when fewer genes are selected as differentially expressed, as is the case in most microarray studies; and (3) the instability of short DEG lists solely based on <it>P</it>-value ranking is an expected mathematical consequence of the high variability of the <it>t</it>-values; the more stringent the <it>P</it>-value threshold, the less reproducible the DEG list is. These observations are also consistent with results from extensive simulation calculations.</p> <p>Conclusion</p> <p>We recommend the use of FC-ranking plus a non-stringent <it>P </it>cutoff as a straightforward and baseline practice in order to generate more reproducible DEG lists. Specifically, the <it>P</it>-value cutoff should not be stringent (too small) and FC should be as large as possible. Our results provide practical guidance to choose the appropriate FC and <it>P</it>-value cutoffs when selecting a given number of DEGs. The FC criterion enhances reproducibility, whereas the <it>P </it>criterion balances sensitivity and specificity.</p

    Methylprednisolone as Adjunct to Endovascular Thrombectomy for Large-Vessel Occlusion Stroke

    Get PDF
    Importance It is uncertain whether intravenous methylprednisolone improves outcomes for patients with acute ischemic stroke due to large-vessel occlusion (LVO) undergoing endovascular thrombectomy. Objective To assess the efficacy and adverse events of adjunctive intravenous low-dose methylprednisolone to endovascular thrombectomy for acute ischemic stroke secondary to LVO. Design, Setting, and Participants This investigator-initiated, randomized, double-blind, placebo-controlled trial was implemented at 82 hospitals in China, enrolling 1680 patients with stroke and proximal intracranial LVO presenting within 24 hours of time last known to be well. Recruitment took place between February 9, 2022, and June 30, 2023, with a final follow-up on September 30, 2023.InterventionsEligible patients were randomly assigned to intravenous methylprednisolone (n = 839) at 2 mg/kg/d or placebo (n = 841) for 3 days adjunctive to endovascular thrombectomy. Main Outcomes and Measures The primary efficacy outcome was disability level at 90 days as measured by the overall distribution of the modified Rankin Scale scores (range, 0 [no symptoms] to 6 [death]). The primary safety outcomes included mortality at 90 days and the incidence of symptomatic intracranial hemorrhage within 48 hours. Results Among 1680 patients randomized (median age, 69 years; 727 female [43.3%]), 1673 (99.6%) completed the trial. The median 90-day modified Rankin Scale score was 3 (IQR, 1-5) in the methylprednisolone group vs 3 (IQR, 1-6) in the placebo group (adjusted generalized odds ratio for a lower level of disability, 1.10 [95% CI, 0.96-1.25]; P = .17). In the methylprednisolone group, there was a lower mortality rate (23.2% vs 28.5%; adjusted risk ratio, 0.84 [95% CI, 0.71-0.98]; P = .03) and a lower rate of symptomatic intracranial hemorrhage (8.6% vs 11.7%; adjusted risk ratio, 0.74 [95% CI, 0.55-0.99]; P = .04) compared with placebo. Conclusions and Relevance Among patients with acute ischemic stroke due to LVO undergoing endovascular thrombectomy, adjunctive methylprednisolone added to endovascular thrombectomy did not significantly improve the degree of overall disability.Trial RegistrationChiCTR.org.cn Identifier: ChiCTR210005172
    corecore