Search CORE

16 research outputs found

Do Transformer Attention Heads Provide Transparency in Abstractive Summarization?

Author: Baan Joris
de Rijke Maarten
Schuth Anne
ter Hoeve Maartje
van der Wees Marlies
Publication venue
Publication date: 08/07/2019
Field of study

Learning algorithms become more powerful, often at the cost of increased complexity. In response, the demand for algorithms to be transparent is growing. In NLP tasks, attention distributions learned by attention-based deep learning models are used to gain insights in the models' behavior. To which extent is this perspective valid for all NLP tasks? We investigate whether distributions calculated by different attention heads in a transformer architecture can be used to improve transparency in the task of abstractive summarization. To this end, we present both a qualitative and quantitative analysis to investigate the behavior of the attention heads. We show that some attention heads indeed specialize towards syntactically and semantically distinct input. We propose an approach to evaluate to which extent the Transformer model relies on specifically learned attention distributions. We also discuss what this implies for using attention distributions as a means of transparency.Comment: To appear at FACTS-IR 2019, SIGI

arXiv.org e-Print Archive

International Migration, Integration and Social Cohesion online publications

UvA-DARE

Why Does My Model Fail? Contrastive Local Explanations for Retail Forecasting

Author: Dhurandhar Amit
Hendricks Lisa Anne
Hoeve Maartje
IJCAI.
Lundberg Scott M
Miller Tim
Ribeiro Marco Tulio
Sharchilev Boris
Publication venue
Publication date: 27/11/2019
Field of study

In various business settings, there is an interest in using more complex machine learning techniques for sales forecasting. It is difficult to convince analysts, along with their superiors, to adopt these techniques since the models are considered to be "black boxes," even if they perform better than current models in use. We examine the impact of contrastive explanations about large errors on users' attitudes towards a "black-box'" model. We propose an algorithm, Monte Carlo Bounds for Reasonable Predictions. Given a large error, MC-BRP determines (1) feature values that would result in a reasonable prediction, and (2) general trends between each feature and the target, both based on Monte Carlo simulations. We evaluate on a real dataset with real users by conducting a user study with 75 participants to determine if explanations generated by MC-BRP help users understand why a prediction results in a large error, and if this promotes trust in an automatically-learned model. Our study shows that users are able to answer objective questions about the model's predictions with overall 81.1% accuracy when provided with these contrastive explanations. We show that users who saw MC-BRP explanations understand why the model makes large errors in predictions significantly more than users in the control group. We also conduct an in-depth analysis on the difference in attitudes between Practitioners and Researchers, and confirm that our results hold when conditioning on the users' background.Comment: To appear in ACM FAT* 202

arXiv.org e-Print Archive

Crossref

International Migration, Integration and Social Cohesion online publications

'It's Reducing a Human Being to a Percentage'; Perceptions of Justice in Algorithmic Decisions

Author: Allan Lind E
Brockner Joel
Bryson Joanna J
Clancey William J
Colquitt Jason A
Cunningham Pádraig
den Bos Kees Van
Dourish Paul
Doyle Donal
Edwards Lilian
Greenberg Jerald
Grgic-Hlaca Nina
Hajian Sara
Hildebrandt Mireille
Hoeve Maartje
Hsia David C
Lipton Zachary C
Mashaw Jerry L
Mayer-Schönberger Viktor
Mitchell Gregory
Moore Johanna D
Norman Donald A
O'Neil Cathy
O'Neill Cathy
Pasquale Frank
Samek Wojciech
Schwartz Paul
Simonite Tom
Tintarev Nava
Tukey John W
Vaia Giovanni
Weller Adrian
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 31/01/2018
Field of study

Data-driven decision-making consequential to individuals raises important questions of accountability and justice. Indeed, European law provides individuals limited rights to 'meaningful information about the logic' behind significant, autonomous decisions such as loan approvals, insurance quotes, and CV filtering. We undertake three experimental studies examining people's perceptions of justice in algorithmic decision-making under different scenarios and explanation styles. Dimensions of justice previously observed in response to human decision-making appear similarly engaged in response to algorithmic decisions. Qualitative analysis identified several concerns and heuristics involved in justice perceptions including arbitrariness, generalisation, and (in)dignity. Quantitative analysis indicates that explanation styles primarily matter to justice perceptions only when subjects are exposed to multiple different styles---under repeated exposure of one style, scenario effects obscure any explanation effects. Our results suggests there may be no 'best' approach to explaining algorithmic decisions, and that reflection on their automated nature both implicates and mitigates justice dimensions.Comment: 14 pages, 3 figures, ACM Conference on Human Factors in Computing Systems (CHI'18), April 21--26, Montreal, Canad

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive

Development of the reflux finding score for infants and its observer agreement

Author: Benninga Marc A.
Hoeve Hans
Kammeijer Quinten
König Astrid M.
Pullens Bas
Singendonk Maartje M. J.
Thomas George
van der Pol Rachel J.
van Spronsen Erik
van Wijk Michiel P.
Vermeeren Lenka
Publication venue: 'Elsevier BV'
Publication date: 01/01/2014
Field of study

It is hypothesized that laryngeal edema is caused by laryngopharyngeal reflux (LPR) (ie, gastroesophageal reflux extending into the larynx and pharynx). The validated reflux finding score (RFS) assesses LPR disease in adults. We, therefore, aimed to develop an adapted RFS for infants (RFS-I) and assess its observer agreement. Visibility of laryngeal anatomic landmarks was assessed by determining observer agreement. The RFS-I was developed based on the RFS, the found observer agreement, and expert opinion. An educational tutorial was developed which was presented to 3 pediatric otorhinolaryngologists, 2 otorhinolaryngologists, and 2 gastroenterology fellows. They then scored videos of flexible laryngoscopy procedures of infants who were either diagnosed with or specifically without laryngeal edema. In total, 52 infants were included with a median age of 19.5 (0-70) weeks, with 12 and 40 infants, respectively, for the assessment of the laryngeal anatomic landmarks and the assessment of the RFS-I. Overall interobserver agreement of the RFS-I was moderate (intraclass correlation coefficient = 0.45). Intraobserver agreement ranged from moderate to excellent agreement (intraclass correlation coefficient = 0.50-0.87). A standardized scoring instrument was developed for the diagnosis of LPR disease using flexible laryngoscopy. Using this tool, only moderate interobserver agreement was reached with a highly variable intraobserver agreement. Because a valid scoring system for flexible laryngoscopy is lacking up until now, the RFS-I and flexible laryngoscopy should not be used solely to clinically assess LPR related findings of the larynx, nor to guide treatmen

EUR Research Repository

IGLU 2022: Interactive Grounded Language Understanding in a Collaborative Environment at NeurIPS 2022

Author: Aliannejadi Mohammad
Arabzadeh Negar
Awadallah Ahmed
Burtsev Mikhail
Côté Marc-Alexandre
Kiseleva Julia
Li Ziming
Mohanty Shrestha
Panov Aleksandr
Skrynnik Alexey
Srinet Kavya
Sun Yuxuan
Szlam Arthur
ter Hoeve Maartje
Teruel Milagro
Volovikova Zoya
Zholus Artem
Publication venue
Publication date: 27/05/2022
Field of study

Human intelligence has the remarkable ability to adapt to new tasks and environments quickly. Starting from a very young age, humans acquire new skills and learn how to solve new tasks either by imitating the behavior of others or by following provided natural language instructions. To facilitate research in this direction, we propose IGLU: Interactive Grounded Language Understanding in a Collaborative Environment. The primary goal of the competition is to approach the problem of how to develop interactive embodied agents that learn to solve a task while provided with grounded natural language instructions in a collaborative environment. Understanding the complexity of the challenge, we split it into sub-tasks to make it feasible for participants. This research challenge is naturally related, but not limited, to two fields of study that are highly relevant to the NeurIPS community: Natural Language Understanding and Generation (NLU/G) and Reinforcement Learning (RL). Therefore, the suggested challenge can bring two communities together to approach one of the crucial challenges in AI. Another critical aspect of the challenge is the dedication to perform a human-in-the-loop evaluation as a final evaluation for the agents developed by contestants.Comment: arXiv admin note: text overlap with arXiv:2110.0653

arXiv.org e-Print Archive

Tu1762 The Reflux Finding Score for Infants (RFS-I): Inter- and Intraobserver Variability

Author: Astrid M. Konig
Bas Pullens
Erik van Spronsen
George Thomas
Hans J. Hoeve
Lenka Vermeeren
Maartje M. Singendonk
Marc A. Benninga
Michiel P. van Wijk
Quinten Kammeijer
Rachel J. van der Pol
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Reliability of the reflux finding score for infants in flexible versus rigid laryngoscopy

Author: Benninga Marc A.
de Gier Henriëtte H.W.
Hoekstra Carlijn E.L.
Hoeve Hans L.J.
König Astrid M.
Pullens Bas
Singendonk Maartje M.J.
van der Pol Rachel J.
van der Schroeff Marc P.
van Heteren Jan A.A.
van Wijk Michiel P.
Veder Laura L.
Publication venue: 'Elsevier BV'
Publication date: 01/07/2016
Field of study

Objectives: The Reflux Finding Score for Infants (RFS-I) was developed to assess signs of laryngopharyngeal reflux (LPR) in infants. With flexible laryngoscopy, moderate inter- and highly variable intraobserver reliability was found. We hypothesized that the use of rigid laryngoscopy would increase reliability and therefore evaluated the reliability of the RFS-I for flexible versus rigid laryngoscopy in infants. Methods: We established a set of videos of consecutively performed flexible and rigid laryngoscopies in infants. The RFS-I was scored twice by 4 otorhinolaryngologists, 2 otorhinolaryngology fellows, and 2 inexperienced observers. Cohen's and Fleiss' kappas (k) were calculated for categorical data and the intraclass correlation coefficient (ICC) was calculated for ordinal data. Results: The study set consisted of laryngoscopic videos of 30 infants (median age 7.5 (0-19.8) months). Overall interobserver reliability of the RFS-I was moderate for both flexible (ICC = 0.60, 95% CI 0.44-0.76) and rigid (ICC = 0.42, 95% CI 0.26-0.62) laryngoscopy. There were no significant differences in reliability of overall RFS-I scores and individual RFS-I items for flexible versus rigid laryngoscopy. Intraobserver reliability of the total RFS-I score ranged from fair to excellent for both flexible (ICC = 0.33-0.93) and rigid (ICC = 0.39-0.86) laryngoscopies. Comparing RFS-I results for flexible versus rigid laryngoscopy per observer, reliability ranged from no to substantial (k = -0.16-0.63, mean k = 0.22), with an observed agreement of 0.08-0.35. Conclusion: Reliability of the RFS-I was moderate and did not differ between flexible and rigid laryngoscopies. The RFS-I is not suitable to detect signs or to guide treatment of LPR in infants, neither with flexible nor with rigid laryngoscopy