16 research outputs found

    Do Transformer Attention Heads Provide Transparency in Abstractive Summarization?

    Get PDF
    Learning algorithms become more powerful, often at the cost of increased complexity. In response, the demand for algorithms to be transparent is growing. In NLP tasks, attention distributions learned by attention-based deep learning models are used to gain insights in the models' behavior. To which extent is this perspective valid for all NLP tasks? We investigate whether distributions calculated by different attention heads in a transformer architecture can be used to improve transparency in the task of abstractive summarization. To this end, we present both a qualitative and quantitative analysis to investigate the behavior of the attention heads. We show that some attention heads indeed specialize towards syntactically and semantically distinct input. We propose an approach to evaluate to which extent the Transformer model relies on specifically learned attention distributions. We also discuss what this implies for using attention distributions as a means of transparency.Comment: To appear at FACTS-IR 2019, SIGI

    Why Does My Model Fail? Contrastive Local Explanations for Retail Forecasting

    Full text link
    In various business settings, there is an interest in using more complex machine learning techniques for sales forecasting. It is difficult to convince analysts, along with their superiors, to adopt these techniques since the models are considered to be "black boxes," even if they perform better than current models in use. We examine the impact of contrastive explanations about large errors on users' attitudes towards a "black-box'" model. We propose an algorithm, Monte Carlo Bounds for Reasonable Predictions. Given a large error, MC-BRP determines (1) feature values that would result in a reasonable prediction, and (2) general trends between each feature and the target, both based on Monte Carlo simulations. We evaluate on a real dataset with real users by conducting a user study with 75 participants to determine if explanations generated by MC-BRP help users understand why a prediction results in a large error, and if this promotes trust in an automatically-learned model. Our study shows that users are able to answer objective questions about the model's predictions with overall 81.1% accuracy when provided with these contrastive explanations. We show that users who saw MC-BRP explanations understand why the model makes large errors in predictions significantly more than users in the control group. We also conduct an in-depth analysis on the difference in attitudes between Practitioners and Researchers, and confirm that our results hold when conditioning on the users' background.Comment: To appear in ACM FAT* 202

    'It's Reducing a Human Being to a Percentage'; Perceptions of Justice in Algorithmic Decisions

    Full text link
    Data-driven decision-making consequential to individuals raises important questions of accountability and justice. Indeed, European law provides individuals limited rights to 'meaningful information about the logic' behind significant, autonomous decisions such as loan approvals, insurance quotes, and CV filtering. We undertake three experimental studies examining people's perceptions of justice in algorithmic decision-making under different scenarios and explanation styles. Dimensions of justice previously observed in response to human decision-making appear similarly engaged in response to algorithmic decisions. Qualitative analysis identified several concerns and heuristics involved in justice perceptions including arbitrariness, generalisation, and (in)dignity. Quantitative analysis indicates that explanation styles primarily matter to justice perceptions only when subjects are exposed to multiple different styles---under repeated exposure of one style, scenario effects obscure any explanation effects. Our results suggests there may be no 'best' approach to explaining algorithmic decisions, and that reflection on their automated nature both implicates and mitigates justice dimensions.Comment: 14 pages, 3 figures, ACM Conference on Human Factors in Computing Systems (CHI'18), April 21--26, Montreal, Canad

    Development of the reflux finding score for infants and its observer agreement

    No full text
    It is hypothesized that laryngeal edema is caused by laryngopharyngeal reflux (LPR) (ie, gastroesophageal reflux extending into the larynx and pharynx). The validated reflux finding score (RFS) assesses LPR disease in adults. We, therefore, aimed to develop an adapted RFS for infants (RFS-I) and assess its observer agreement. Visibility of laryngeal anatomic landmarks was assessed by determining observer agreement. The RFS-I was developed based on the RFS, the found observer agreement, and expert opinion. An educational tutorial was developed which was presented to 3 pediatric otorhinolaryngologists, 2 otorhinolaryngologists, and 2 gastroenterology fellows. They then scored videos of flexible laryngoscopy procedures of infants who were either diagnosed with or specifically without laryngeal edema. In total, 52 infants were included with a median age of 19.5 (0-70) weeks, with 12 and 40 infants, respectively, for the assessment of the laryngeal anatomic landmarks and the assessment of the RFS-I. Overall interobserver agreement of the RFS-I was moderate (intraclass correlation coefficient = 0.45). Intraobserver agreement ranged from moderate to excellent agreement (intraclass correlation coefficient = 0.50-0.87). A standardized scoring instrument was developed for the diagnosis of LPR disease using flexible laryngoscopy. Using this tool, only moderate interobserver agreement was reached with a highly variable intraobserver agreement. Because a valid scoring system for flexible laryngoscopy is lacking up until now, the RFS-I and flexible laryngoscopy should not be used solely to clinically assess LPR related findings of the larynx, nor to guide treatmen

    IGLU 2022: Interactive Grounded Language Understanding in a Collaborative Environment at NeurIPS 2022

    Full text link
    Human intelligence has the remarkable ability to adapt to new tasks and environments quickly. Starting from a very young age, humans acquire new skills and learn how to solve new tasks either by imitating the behavior of others or by following provided natural language instructions. To facilitate research in this direction, we propose IGLU: Interactive Grounded Language Understanding in a Collaborative Environment. The primary goal of the competition is to approach the problem of how to develop interactive embodied agents that learn to solve a task while provided with grounded natural language instructions in a collaborative environment. Understanding the complexity of the challenge, we split it into sub-tasks to make it feasible for participants. This research challenge is naturally related, but not limited, to two fields of study that are highly relevant to the NeurIPS community: Natural Language Understanding and Generation (NLU/G) and Reinforcement Learning (RL). Therefore, the suggested challenge can bring two communities together to approach one of the crucial challenges in AI. Another critical aspect of the challenge is the dedication to perform a human-in-the-loop evaluation as a final evaluation for the agents developed by contestants.Comment: arXiv admin note: text overlap with arXiv:2110.0653

    Reliability of the reflux finding score for infants in flexible versus rigid laryngoscopy

    No full text
    Objectives: The Reflux Finding Score for Infants (RFS-I) was developed to assess signs of laryngopharyngeal reflux (LPR) in infants. With flexible laryngoscopy, moderate inter- and highly variable intraobserver reliability was found. We hypothesized that the use of rigid laryngoscopy would increase reliability and therefore evaluated the reliability of the RFS-I for flexible versus rigid laryngoscopy in infants. Methods: We established a set of videos of consecutively performed flexible and rigid laryngoscopies in infants. The RFS-I was scored twice by 4 otorhinolaryngologists, 2 otorhinolaryngology fellows, and 2 inexperienced observers. Cohen's and Fleiss' kappas (k) were calculated for categorical data and the intraclass correlation coefficient (ICC) was calculated for ordinal data. Results: The study set consisted of laryngoscopic videos of 30 infants (median age 7.5 (0-19.8) months). Overall interobserver reliability of the RFS-I was moderate for both flexible (ICC = 0.60, 95% CI 0.44-0.76) and rigid (ICC = 0.42, 95% CI 0.26-0.62) laryngoscopy. There were no significant differences in reliability of overall RFS-I scores and individual RFS-I items for flexible versus rigid laryngoscopy. Intraobserver reliability of the total RFS-I score ranged from fair to excellent for both flexible (ICC = 0.33-0.93) and rigid (ICC = 0.39-0.86) laryngoscopies. Comparing RFS-I results for flexible versus rigid laryngoscopy per observer, reliability ranged from no to substantial (k = -0.16-0.63, mean k = 0.22), with an observed agreement of 0.08-0.35. Conclusion: Reliability of the RFS-I was moderate and did not differ between flexible and rigid laryngoscopies. The RFS-I is not suitable to detect signs or to guide treatment of LPR in infants, neither with flexible nor with rigid laryngoscopy
    corecore