3 research outputs found
Walking a Tightrope -- Evaluating Large Language Models in High-Risk Domains
High-risk domains pose unique challenges that require language models to
provide accurate and safe responses. Despite the great success of large
language models (LLMs), such as ChatGPT and its variants, their performance in
high-risk domains remains unclear. Our study delves into an in-depth analysis
of the performance of instruction-tuned LLMs, focusing on factual accuracy and
safety adherence. To comprehensively assess the capabilities of LLMs, we
conduct experiments on six NLP datasets including question answering and
summarization tasks within two high-risk domains: legal and medical. Further
qualitative analysis highlights the existing limitations inherent in current
LLMs when evaluating in high-risk domains. This underscores the essential
nature of not only improving LLM capabilities but also prioritizing the
refinement of domain-specific metrics, and embracing a more human-centric
approach to enhance safety and factual reliability. Our findings advance the
field toward the concerns of properly evaluating LLMs in high-risk domains,
aiming to steer the adaptability of LLMs in fulfilling societal obligations and
aligning with forthcoming regulations, such as the EU AI Act.Comment: EMNLP 2023 Workshop on Benchmarking Generalisation in NLP (GenBench
A Human-Centric Assessment Framework for AI
With the rise of AI systems in real-world applications comes the need for
reliable and trustworthy AI. An essential aspect of this are explainable AI
systems. However, there is no agreed standard on how explainable AI systems
should be assessed. Inspired by the Turing test, we introduce a human-centric
assessment framework where a leading domain expert accepts or rejects the
solutions of an AI system and another domain expert. By comparing the
acceptance rates of provided solutions, we can assess how the AI system
performs compared to the domain expert, and whether the AI system's
explanations (if provided) are human-understandable. This setup -- comparable
to the Turing test -- can serve as a framework for a wide range of
human-centric AI system assessments. We demonstrate this by presenting two
instantiations: (1) an assessment that measures the classification accuracy of
a system with the option to incorporate label uncertainties; (2) an assessment
where the usefulness of provided explanations is determined in a human-centric
manner.Comment: Accepted as submission to ICML 2022 Workshop on Human-Machine
Collaboration and Teamin
Genotype-specific patterns of physiological and antioxidative responses in barley under salinity stress
Using reliable salt tolerance markers is a key component in barley breeding programs. In this study, physiological and antioxidative markers of two Tunisian barley salinity tolerance contrasting genotypes Boulifa (B) and Manzel Habib (MH) were assessed at 0, 3, 6 and 9 days of 200 mM salt treatment. Salinity caused decrease in growth, degraded photosynthetic activity and reduced water-holding capacity in both genotypes with more pronounced negative effects in the salt-sensitive (MH) compared to the salt-tolerant (B) genotype. On the other hand, the lower oxidative damage in B compared to MH under salt stress could be explained by higher activities of antioxidant enzymes such as superoxide dismutase (SOD), ascorbate peroxidase (APX), glutathione peroxidase (GPX) and glutathione reductase (GR). Additionally, a genotype-specific pattern of enzyme activity and corresponding gene expression was revealed in the two barleys under salt stress. In this context, a positive correlation was noted for the SOD. On the other hand, multivariate analysis marked SOD and APX as the most discriminating factors between both stressed genotypes. Our findings could be considered for selection in breeding programs for salt stress tolerance in barley