3 research outputs found

    Walking a Tightrope -- Evaluating Large Language Models in High-Risk Domains

    Full text link
    High-risk domains pose unique challenges that require language models to provide accurate and safe responses. Despite the great success of large language models (LLMs), such as ChatGPT and its variants, their performance in high-risk domains remains unclear. Our study delves into an in-depth analysis of the performance of instruction-tuned LLMs, focusing on factual accuracy and safety adherence. To comprehensively assess the capabilities of LLMs, we conduct experiments on six NLP datasets including question answering and summarization tasks within two high-risk domains: legal and medical. Further qualitative analysis highlights the existing limitations inherent in current LLMs when evaluating in high-risk domains. This underscores the essential nature of not only improving LLM capabilities but also prioritizing the refinement of domain-specific metrics, and embracing a more human-centric approach to enhance safety and factual reliability. Our findings advance the field toward the concerns of properly evaluating LLMs in high-risk domains, aiming to steer the adaptability of LLMs in fulfilling societal obligations and aligning with forthcoming regulations, such as the EU AI Act.Comment: EMNLP 2023 Workshop on Benchmarking Generalisation in NLP (GenBench

    A Human-Centric Assessment Framework for AI

    Full text link
    With the rise of AI systems in real-world applications comes the need for reliable and trustworthy AI. An essential aspect of this are explainable AI systems. However, there is no agreed standard on how explainable AI systems should be assessed. Inspired by the Turing test, we introduce a human-centric assessment framework where a leading domain expert accepts or rejects the solutions of an AI system and another domain expert. By comparing the acceptance rates of provided solutions, we can assess how the AI system performs compared to the domain expert, and whether the AI system's explanations (if provided) are human-understandable. This setup -- comparable to the Turing test -- can serve as a framework for a wide range of human-centric AI system assessments. We demonstrate this by presenting two instantiations: (1) an assessment that measures the classification accuracy of a system with the option to incorporate label uncertainties; (2) an assessment where the usefulness of provided explanations is determined in a human-centric manner.Comment: Accepted as submission to ICML 2022 Workshop on Human-Machine Collaboration and Teamin

    Genotype-specific patterns of physiological and antioxidative responses in barley under salinity stress

    No full text
    Using reliable salt tolerance markers is a key component in barley breeding programs. In this study, physiological and antioxidative markers of two Tunisian barley salinity tolerance contrasting genotypes Boulifa (B) and Manzel Habib (MH) were assessed at 0, 3, 6 and 9 days of 200 mM salt treatment. Salinity caused decrease in growth, degraded photosynthetic activity and reduced water-holding capacity in both genotypes with more pronounced negative effects in the salt-sensitive (MH) compared to the salt-tolerant (B) genotype. On the other hand, the lower oxidative damage in B compared to MH under salt stress could be explained by higher activities of antioxidant enzymes such as superoxide dismutase (SOD), ascorbate peroxidase (APX), glutathione peroxidase (GPX) and glutathione reductase (GR). Additionally, a genotype-specific pattern of enzyme activity and corresponding gene expression was revealed in the two barleys under salt stress. In this context, a positive correlation was noted for the SOD. On the other hand, multivariate analysis marked SOD and APX as the most discriminating factors between both stressed genotypes. Our findings could be considered for selection in breeding programs for salt stress tolerance in barley
    corecore