63 research outputs found
"Help Me Help the AI": Understanding How Explainability Can Support Human-AI Interaction
Despite the proliferation of explainable AI (XAI) methods, little is
understood about end-users' explainability needs. This gap is critical, because
end-users may have needs that XAI methods should but don't yet support. To
address this gap and contribute to understanding how explainability can support
human-AI interaction, we conducted a study of a real-world AI application via
interviews with 20 end-users of Merlin, a bird-identification app. We found
that people express a need for practically useful information that can improve
their collaboration with the AI system, and intend to use XAI explanations for
calibrating trust, improving their task skills, changing their behavior to
supply better inputs to the AI system, and giving constructive feedback to
developers. We also assessed end-users' perceptions of existing XAI approaches,
finding that they prefer part-based explanations. Finally, we discuss
implications of our findings and provide recommendations for future designs of
XAI, specifically XAI for human-AI collaboration
Intra-household evaluations of alcohol abuse in men with depression and suicide in women: A cross-sectional community-based study in Chennai, India.
BACKGROUND: Harmful effects of alcohol abuse are well documented for drinkers, and adverse effects are also reported for the physical and emotional well-being of family members, with evidence often originating from either drinkers or their families in clinic-based settings. This study evaluates intra-household associations between alcohol abuse in men, and depression and suicidal attempts in women, in community-based settings of Chennai, India. METHODS: This community-based cross-sectional study of chronic disease risk factors and outcomes was conducted in n = 259 households and n = 1053 adults (aged 15 years and above) in rural and urban Chennai. The Alcohol Use Disorder Identification Test (AUDIT) score was used to classify alcohol consumption into 'low-risk', 'harmful', 'hazardous' and 'alcohol dependence' drinking and the Patient Health Questionnaire (PHQ-9) score to classify depression as 'mild', 'moderate', 'moderate-severe' and 'severe'. Multivariate logistic regression models estimated the association of depression in women with men's drinking patterns in the same household. RESULTS: A significant 2.5-fold increase in any depression (PHQ-9 ≥ 5) was observed in men who were 'alcohol-dependent' compared to non-drinkers (OR = 2.53; 95% CI: 1.26, 5.09). However, there was no association between men's drinking behavior and depression in women of the same household, although suicidal attempts approached a significant dose-response relationship with increasing hazard-level of men's drinking (p = 0.08). CONCLUSION: No significant intra-household association was observed between men's alcohol consumption and women's depression, though an increasing (non-significant) trend was associated with suicidal attempts. Complex relationships between suicidal attempts and depression in women and male abusive drinking require further exploration, with an emphasis on intra-household mechanisms and pathways
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization
Instruction tuning large language models (LLMs) remains a challenging task,
owing to the complexity of hyperparameter selection and the difficulty involved
in evaluating the tuned models. To determine the optimal hyperparameters, an
automatic, robust, and reliable evaluation benchmark is essential. However,
establishing such a benchmark is not a trivial task due to the challenges
associated with evaluation accuracy and privacy protection. In response to
these challenges, we introduce a judge large language model, named PandaLM,
which is trained to distinguish the superior model given several LLMs.
PandaLM's focus extends beyond just the objective correctness of responses,
which is the main focus of traditional evaluation datasets. It addresses vital
subjective factors such as relative conciseness, clarity, adherence to
instructions, comprehensiveness, and formality. To ensure the reliability of
PandaLM, we collect a diverse human-annotated test dataset, where all contexts
are generated by humans and labels are aligned with human preferences. Our
results indicate that PandaLM-7B achieves 93.75% of GPT-3.5's evaluation
ability and 88.28% of GPT-4's in terms of F1-score on our test dataset. PandaLM
enables the evaluation of LLM to be fairer but with less cost, evidenced by
significant improvements achieved by models tuned through PandaLM compared to
their counterparts trained with default Alpaca's hyperparameters. In addition,
PandaLM does not depend on API-based evaluations, thus avoiding potential data
leakage. All resources of PandaLM are released at
https://github.com/WeOpenML/PandaLM
Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses?
We evaluated the capability of generative pre-trained transformers (GPT), to
pass assessments in introductory and intermediate Python programming courses at
the postsecondary level. Discussions of potential uses (e.g., exercise
generation, code explanation) and misuses (e.g., cheating) of this emerging
technology in programming education have intensified, but to date there has not
been a rigorous analysis of the models' capabilities in the realistic context
of a full-fledged programming course with diverse set of assessment
instruments. We evaluated GPT on three Python courses that employ assessments
ranging from simple multiple-choice questions (no code involved) to complex
programming projects with code bases distributed into multiple files (599
exercises overall). Further, we studied if and how successfully GPT models
leverage feedback provided by an auto-grader. We found that the current models
are not capable of passing the full spectrum of assessments typically involved
in a Python programming course (<70% on even entry-level modules). Yet, it is
clear that a straightforward application of these easily accessible models
could enable a learner to obtain a non-trivial portion of the overall available
score (>55%) in introductory and intermediate courses alike. While the models
exhibit remarkable capabilities, including correcting solutions based on
auto-grader's feedback, some limitations exist (e.g., poor handling of
exercises requiring complex chains of reasoning steps). These findings can be
leveraged by instructors wishing to adapt their assessments so that GPT becomes
a valuable assistant for a learner as opposed to an end-to-end solution.Comment: 7 pages. arXiv admin note: text overlap with arXiv:2303.0803
Centre Commissioned External Review (CCER) of the IWMI-TATA Water Policy Research Program
Agricultural research / Research projects / Project appraisal / Financing / Institutional development / Evaluation / Water policy / Water management / Irrigation management / Groundwater
- …