16 research outputs found
Do Transformer Attention Heads Provide Transparency in Abstractive Summarization?
Learning algorithms become more powerful, often at the cost of increased
complexity. In response, the demand for algorithms to be transparent is
growing. In NLP tasks, attention distributions learned by attention-based deep
learning models are used to gain insights in the models' behavior. To which
extent is this perspective valid for all NLP tasks? We investigate whether
distributions calculated by different attention heads in a transformer
architecture can be used to improve transparency in the task of abstractive
summarization. To this end, we present both a qualitative and quantitative
analysis to investigate the behavior of the attention heads. We show that some
attention heads indeed specialize towards syntactically and semantically
distinct input. We propose an approach to evaluate to which extent the
Transformer model relies on specifically learned attention distributions. We
also discuss what this implies for using attention distributions as a means of
transparency.Comment: To appear at FACTS-IR 2019, SIGI
Why Does My Model Fail? Contrastive Local Explanations for Retail Forecasting
In various business settings, there is an interest in using more complex
machine learning techniques for sales forecasting. It is difficult to convince
analysts, along with their superiors, to adopt these techniques since the
models are considered to be "black boxes," even if they perform better than
current models in use. We examine the impact of contrastive explanations about
large errors on users' attitudes towards a "black-box'" model. We propose an
algorithm, Monte Carlo Bounds for Reasonable Predictions. Given a large error,
MC-BRP determines (1) feature values that would result in a reasonable
prediction, and (2) general trends between each feature and the target, both
based on Monte Carlo simulations. We evaluate on a real dataset with real users
by conducting a user study with 75 participants to determine if explanations
generated by MC-BRP help users understand why a prediction results in a large
error, and if this promotes trust in an automatically-learned model. Our study
shows that users are able to answer objective questions about the model's
predictions with overall 81.1% accuracy when provided with these contrastive
explanations. We show that users who saw MC-BRP explanations understand why the
model makes large errors in predictions significantly more than users in the
control group. We also conduct an in-depth analysis on the difference in
attitudes between Practitioners and Researchers, and confirm that our results
hold when conditioning on the users' background.Comment: To appear in ACM FAT* 202
'It's Reducing a Human Being to a Percentage'; Perceptions of Justice in Algorithmic Decisions
Data-driven decision-making consequential to individuals raises important
questions of accountability and justice. Indeed, European law provides
individuals limited rights to 'meaningful information about the logic' behind
significant, autonomous decisions such as loan approvals, insurance quotes, and
CV filtering. We undertake three experimental studies examining people's
perceptions of justice in algorithmic decision-making under different scenarios
and explanation styles. Dimensions of justice previously observed in response
to human decision-making appear similarly engaged in response to algorithmic
decisions. Qualitative analysis identified several concerns and heuristics
involved in justice perceptions including arbitrariness, generalisation, and
(in)dignity. Quantitative analysis indicates that explanation styles primarily
matter to justice perceptions only when subjects are exposed to multiple
different styles---under repeated exposure of one style, scenario effects
obscure any explanation effects. Our results suggests there may be no 'best'
approach to explaining algorithmic decisions, and that reflection on their
automated nature both implicates and mitigates justice dimensions.Comment: 14 pages, 3 figures, ACM Conference on Human Factors in Computing
Systems (CHI'18), April 21--26, Montreal, Canad
Development of the reflux finding score for infants and its observer agreement
It is hypothesized that laryngeal edema is caused by laryngopharyngeal reflux (LPR) (ie, gastroesophageal reflux extending into the larynx and pharynx). The validated reflux finding score (RFS) assesses LPR disease in adults. We, therefore, aimed to develop an adapted RFS for infants (RFS-I) and assess its observer agreement. Visibility of laryngeal anatomic landmarks was assessed by determining observer agreement. The RFS-I was developed based on the RFS, the found observer agreement, and expert opinion. An educational tutorial was developed which was presented to 3 pediatric otorhinolaryngologists, 2 otorhinolaryngologists, and 2 gastroenterology fellows. They then scored videos of flexible laryngoscopy procedures of infants who were either diagnosed with or specifically without laryngeal edema. In total, 52 infants were included with a median age of 19.5 (0-70) weeks, with 12 and 40 infants, respectively, for the assessment of the laryngeal anatomic landmarks and the assessment of the RFS-I. Overall interobserver agreement of the RFS-I was moderate (intraclass correlation coefficient = 0.45). Intraobserver agreement ranged from moderate to excellent agreement (intraclass correlation coefficient = 0.50-0.87). A standardized scoring instrument was developed for the diagnosis of LPR disease using flexible laryngoscopy. Using this tool, only moderate interobserver agreement was reached with a highly variable intraobserver agreement. Because a valid scoring system for flexible laryngoscopy is lacking up until now, the RFS-I and flexible laryngoscopy should not be used solely to clinically assess LPR related findings of the larynx, nor to guide treatmen
IGLU 2022: Interactive Grounded Language Understanding in a Collaborative Environment at NeurIPS 2022
Human intelligence has the remarkable ability to adapt to new tasks and
environments quickly. Starting from a very young age, humans acquire new skills
and learn how to solve new tasks either by imitating the behavior of others or
by following provided natural language instructions. To facilitate research in
this direction, we propose IGLU: Interactive Grounded Language Understanding in
a Collaborative Environment. The primary goal of the competition is to approach
the problem of how to develop interactive embodied agents that learn to solve a
task while provided with grounded natural language instructions in a
collaborative environment. Understanding the complexity of the challenge, we
split it into sub-tasks to make it feasible for participants.
This research challenge is naturally related, but not limited, to two fields
of study that are highly relevant to the NeurIPS community: Natural Language
Understanding and Generation (NLU/G) and Reinforcement Learning (RL).
Therefore, the suggested challenge can bring two communities together to
approach one of the crucial challenges in AI. Another critical aspect of the
challenge is the dedication to perform a human-in-the-loop evaluation as a
final evaluation for the agents developed by contestants.Comment: arXiv admin note: text overlap with arXiv:2110.0653
Reliability of the reflux finding score for infants in flexible versus rigid laryngoscopy
Objectives: The Reflux Finding Score for Infants (RFS-I) was developed to assess signs of laryngopharyngeal reflux (LPR) in infants. With flexible laryngoscopy, moderate inter- and highly variable intraobserver reliability was found. We hypothesized that the use of rigid laryngoscopy would increase reliability and therefore evaluated the reliability of the RFS-I for flexible versus rigid laryngoscopy in infants. Methods: We established a set of videos of consecutively performed flexible and rigid laryngoscopies in infants. The RFS-I was scored twice by 4 otorhinolaryngologists, 2 otorhinolaryngology fellows, and 2 inexperienced observers. Cohen's and Fleiss' kappas (k) were calculated for categorical data and the intraclass correlation coefficient (ICC) was calculated for ordinal data. Results: The study set consisted of laryngoscopic videos of 30 infants (median age 7.5 (0-19.8) months). Overall interobserver reliability of the RFS-I was moderate for both flexible (ICC = 0.60, 95% CI 0.44-0.76) and rigid (ICC = 0.42, 95% CI 0.26-0.62) laryngoscopy. There were no significant differences in reliability of overall RFS-I scores and individual RFS-I items for flexible versus rigid laryngoscopy. Intraobserver reliability of the total RFS-I score ranged from fair to excellent for both flexible (ICC = 0.33-0.93) and rigid (ICC = 0.39-0.86) laryngoscopies. Comparing RFS-I results for flexible versus rigid laryngoscopy per observer, reliability ranged from no to substantial (k = -0.16-0.63, mean k = 0.22), with an observed agreement of 0.08-0.35. Conclusion: Reliability of the RFS-I was moderate and did not differ between flexible and rigid laryngoscopies. The RFS-I is not suitable to detect signs or to guide treatment of LPR in infants, neither with flexible nor with rigid laryngoscopy