63 research outputs found

    "Help Me Help the AI": Understanding How Explainability Can Support Human-AI Interaction

    Full text link
    Despite the proliferation of explainable AI (XAI) methods, little is understood about end-users' explainability needs. This gap is critical, because end-users may have needs that XAI methods should but don't yet support. To address this gap and contribute to understanding how explainability can support human-AI interaction, we conducted a study of a real-world AI application via interviews with 20 end-users of Merlin, a bird-identification app. We found that people express a need for practically useful information that can improve their collaboration with the AI system, and intend to use XAI explanations for calibrating trust, improving their task skills, changing their behavior to supply better inputs to the AI system, and giving constructive feedback to developers. We also assessed end-users' perceptions of existing XAI approaches, finding that they prefer part-based explanations. Finally, we discuss implications of our findings and provide recommendations for future designs of XAI, specifically XAI for human-AI collaboration

    Intra-household evaluations of alcohol abuse in men with depression and suicide in women: A cross-sectional community-based study in Chennai, India.

    Get PDF
    BACKGROUND: Harmful effects of alcohol abuse are well documented for drinkers, and adverse effects are also reported for the physical and emotional well-being of family members, with evidence often originating from either drinkers or their families in clinic-based settings. This study evaluates intra-household associations between alcohol abuse in men, and depression and suicidal attempts in women, in community-based settings of Chennai, India. METHODS: This community-based cross-sectional study of chronic disease risk factors and outcomes was conducted in n = 259 households and n = 1053 adults (aged 15 years and above) in rural and urban Chennai. The Alcohol Use Disorder Identification Test (AUDIT) score was used to classify alcohol consumption into 'low-risk', 'harmful', 'hazardous' and 'alcohol dependence' drinking and the Patient Health Questionnaire (PHQ-9) score to classify depression as 'mild', 'moderate', 'moderate-severe' and 'severe'. Multivariate logistic regression models estimated the association of depression in women with men's drinking patterns in the same household. RESULTS: A significant 2.5-fold increase in any depression (PHQ-9 ≥ 5) was observed in men who were 'alcohol-dependent' compared to non-drinkers (OR = 2.53; 95% CI: 1.26, 5.09). However, there was no association between men's drinking behavior and depression in women of the same household, although suicidal attempts approached a significant dose-response relationship with increasing hazard-level of men's drinking (p = 0.08). CONCLUSION: No significant intra-household association was observed between men's alcohol consumption and women's depression, though an increasing (non-significant) trend was associated with suicidal attempts. Complex relationships between suicidal attempts and depression in women and male abusive drinking require further exploration, with an emphasis on intra-household mechanisms and pathways

    PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization

    Full text link
    Instruction tuning large language models (LLMs) remains a challenging task, owing to the complexity of hyperparameter selection and the difficulty involved in evaluating the tuned models. To determine the optimal hyperparameters, an automatic, robust, and reliable evaluation benchmark is essential. However, establishing such a benchmark is not a trivial task due to the challenges associated with evaluation accuracy and privacy protection. In response to these challenges, we introduce a judge large language model, named PandaLM, which is trained to distinguish the superior model given several LLMs. PandaLM's focus extends beyond just the objective correctness of responses, which is the main focus of traditional evaluation datasets. It addresses vital subjective factors such as relative conciseness, clarity, adherence to instructions, comprehensiveness, and formality. To ensure the reliability of PandaLM, we collect a diverse human-annotated test dataset, where all contexts are generated by humans and labels are aligned with human preferences. Our results indicate that PandaLM-7B achieves 93.75% of GPT-3.5's evaluation ability and 88.28% of GPT-4's in terms of F1-score on our test dataset. PandaLM enables the evaluation of LLM to be fairer but with less cost, evidenced by significant improvements achieved by models tuned through PandaLM compared to their counterparts trained with default Alpaca's hyperparameters. In addition, PandaLM does not depend on API-based evaluations, thus avoiding potential data leakage. All resources of PandaLM are released at https://github.com/WeOpenML/PandaLM

    Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses?

    Full text link
    We evaluated the capability of generative pre-trained transformers (GPT), to pass assessments in introductory and intermediate Python programming courses at the postsecondary level. Discussions of potential uses (e.g., exercise generation, code explanation) and misuses (e.g., cheating) of this emerging technology in programming education have intensified, but to date there has not been a rigorous analysis of the models' capabilities in the realistic context of a full-fledged programming course with diverse set of assessment instruments. We evaluated GPT on three Python courses that employ assessments ranging from simple multiple-choice questions (no code involved) to complex programming projects with code bases distributed into multiple files (599 exercises overall). Further, we studied if and how successfully GPT models leverage feedback provided by an auto-grader. We found that the current models are not capable of passing the full spectrum of assessments typically involved in a Python programming course (<70% on even entry-level modules). Yet, it is clear that a straightforward application of these easily accessible models could enable a learner to obtain a non-trivial portion of the overall available score (>55%) in introductory and intermediate courses alike. While the models exhibit remarkable capabilities, including correcting solutions based on auto-grader's feedback, some limitations exist (e.g., poor handling of exercises requiring complex chains of reasoning steps). These findings can be leveraged by instructors wishing to adapt their assessments so that GPT becomes a valuable assistant for a learner as opposed to an end-to-end solution.Comment: 7 pages. arXiv admin note: text overlap with arXiv:2303.0803

    Centre Commissioned External Review (CCER) of the IWMI-TATA Water Policy Research Program

    Get PDF
    Agricultural research / Research projects / Project appraisal / Financing / Institutional development / Evaluation / Water policy / Water management / Irrigation management / Groundwater
    • …
    corecore