651 research outputs found
Recommended from our members
Learning Latent Characteristics of Data and Models using Item Response Theory
A supervised machine learning model is trained with a large set of labeled training data, and evaluated on a smaller but still large set of test data. Especially with deep neural networks (DNNs), the complexity of the model requires that an extremely large data set is collected to prevent overfitting. It is often the case that these models do not take into account specific attributes of the training set examples, but instead treat each equally in the process of model training. This is due to the fact that it is difficult to model latent traits of individual examples at the scale of hundreds of thousands or millions of data points. However, there exist a set of psychometric methods that can model attributes of specific examples and can greatly improve model training and evaluation in the supervised learning process.
Item Response Theory (IRT) is a well-studied psychometric methodology for scale construction and evaluation. IRT jointly models human ability and example characteristics such as difficulty based on human response data. We introduce new evaluation metrics for both humans and machine learning models build using IRT, and propose new methods for applying IRT to machine learning-scale data.
We use IRT to make contributions to the machine learning community in the following areas: (i) new test sets for evaluating machine learning models with respect to a human population, (ii) new insights about how deep-learning models learn by tracking example difficulty and training conditions, and (iii) new methods for data selection and curriculum building to improve model training efficiency, (iv) a new test of electronic health literacy built with questions extracted from de-identified patient Electronic Health Records (EHRs).
We first introduce two new evaluation sets built and validated using IRT. These tests are the first IRT test sets to be applied to natural language processing tasks. Using IRT test sets allows for more comprehensive comparison of NLP models. Second, by modeling the difficulty of test set examples, we identify patterns that emerge when training deep neural network models that are consistent with human learning patterns. Specifically, as models are trained with larger training sets, they learn easy test set examples more quickly than hard examples. Third, we present a method for using soft labels on a subset of training data to improve deep learning model generalization. We show that fine-tuning a trained deep neural network with as little as 0.1% of the training data can improve model generalization in terms of test set accuracy. Fourth, we propose a new method for estimating IRT example and model parameters that allows for learning parameters at a much larger scale than previously available to accommodate the large data sets required for deep learning. This allows for learning IRT models at machine learning scale, with hundreds of thousands of examples and large ensembles of machine learning models. The response patterns of machine learning models can be used to learn IRT example characteristics instead of human response patterns. Fifth, we introduce a dynamic curriculum learning process that estimates model competency during training to adaptively select training data that is appropriate for learning at the given epoch. Finally, we introduce the ComprehENotes test, the first test of EHR comprehension for humans. The test is an accurate measure for identifying individuals with low EHR note comprehension ability, and validates the effectiveness of previously self-reported patient comprehension evaluations
H-COAL: Human Correction of AI-Generated Labels for Biomedical Named Entity Recognition
With the rapid advancement of machine learning models for NLP tasks,
collecting high-fidelity labels from AI models is a realistic possibility.
Firms now make AI available to customers via predictions as a service (PaaS).
This includes PaaS products for healthcare. It is unclear whether these labels
can be used for training a local model without expensive annotation checking by
in-house experts. In this work, we propose a new framework for Human Correction
of AI-Generated Labels (H-COAL). By ranking AI-generated outputs, one can
selectively correct labels and approach gold standard performance (100% human
labeling) with significantly less human effort. We show that correcting 5% of
labels can close the AI-human performance gap by up to 64% relative
improvement, and correcting 20% of labels can close the performance gap by up
to 86% relative improvement.Comment: Presented at Conference on Information Systems and Technology (CIST)
202
Sexuality and Social Justice: Whatâs Law Got to Do with It? International Symposium Workshop Report
In March 2015, the Sexuality, Poverty and Law programme at the Institute of Development Studies brought together over 60 activists, lawyers, researchers and international advocates to critically assess the scope of law and legal activism for achieving social justice for those marginalised because of their sexual or gender non-conformity. Delegates represented a broad range of expertise in the field of sexuality, gender identity, rights and social justice. They included a number of leading lawyers and activists involved in litigating cases of sexual and gender rights in countries such as Uganda, Malaysia, the United Kingdom, Argentina and Botswana. Lawyers and activists shared their experiences of working within this fast developing area of domestic and international law. Discussions also addressed the wider social and theoretical aspects of recent legal developments, contributing to our understanding of the complex relationship between research, knowledge exchange, activism and law.UK Department for International Developmen
Growing grass for a green biorefinery - an option for Ireland?
Growing grass for a green biorefinery â an option for Ireland? Mind the gap: deciphering the gap between good intentions and healthy eating behaviour Halting biodiversity loss by 2020 â implications for agriculture A milk processing sector model for Irelan
Improving Electronic Health Record Note Comprehension With NoteAid: Randomized Trial of Electronic Health Record Note Comprehension Interventions With Crowdsourced Workers
BACKGROUND: Patient portals are becoming more common, and with them, the ability of patients to access their personal electronic health records (EHRs). EHRs, in particular the free-text EHR notes, often contain medical jargon and terms that are difficult for laypersons to understand. There are many Web-based resources for learning more about particular diseases or conditions, including systems that directly link to lay definitions or educational materials for medical concepts.
OBJECTIVE: Our goal is to determine whether use of one such tool, NoteAid, leads to higher EHR note comprehension ability. We use a new EHR note comprehension assessment tool instead of patient self-reported scores.
METHODS: In this work, we compare a passive, self-service educational resource (MedlinePlus) with an active resource (NoteAid) where definitions are provided to the user for medical concepts that the system identifies. We use Amazon Mechanical Turk (AMT) to recruit individuals to complete ComprehENotes, a new test of EHR note comprehension.
RESULTS: Mean scores for individuals with access to NoteAid are significantly higher than the mean baseline scores, both for raw scores (P=.008) and estimated ability (P=.02).
CONCLUSIONS: In our experiments, we show that the active intervention leads to significantly higher scores on the comprehension test as compared with a baseline group with no resources provided. In contrast, there is no significant difference between the group that was provided with the passive intervention and the baseline group. Finally, we analyze the demographics of the individuals who participated in our AMT task and show differences between groups that align with the current understanding of health literacy between populations. This is the first work to show improvements in comprehension using tools such as NoteAid as measured by an EHR note comprehension assessment tool as opposed to patient self-reported scores
Sustainability in the biopharmaceutical industry: seeking a holistic perspective
Biopharmaceuticals manufacturing is a critical component of the modern healthcare system, with emerging new treatments composed of increasingly complex biomolecules offering solutions to chronic and debilitating disorders. While this sector continues to grow, it strongly exhibits âboom-to-bustâ performance which threatens its long-term viability. Future trends within the industry indicate a shift towards continuous production systems using single use technologies that raises sustainability issues, yet research in this area is sparse and lacks consideration of the complex interactions between environmental, social and economic concerns. The authors outline a sustainability-focused vision and propose opportunities for research to aid the development of a more integrated approach that would enhance the sustainability of the industry
Bias A-head? Analyzing Bias in Transformer-Based Language Model Attention Heads
Transformer-based pretrained large language models (PLM) such as BERT and GPT
have achieved remarkable success in NLP tasks. However, PLMs are prone to
encoding stereotypical biases. Although a burgeoning literature has emerged on
stereotypical bias mitigation in PLMs, such as work on debiasing gender and
racial stereotyping, how such biases manifest and behave internally within PLMs
remains largely unknown. Understanding the internal stereotyping mechanisms may
allow better assessment of model fairness and guide the development of
effective mitigation strategies. In this work, we focus on attention heads, a
major component of the Transformer architecture, and propose a bias analysis
framework to explore and identify a small set of biased heads that are found to
contribute to a PLM's stereotypical bias. We conduct extensive experiments to
validate the existence of these biased heads and to better understand how they
behave. We investigate gender and racial bias in the English language in two
types of Transformer-based PLMs: the encoder-based BERT model and the
decoder-based autoregressive GPT model. Overall, the results shed light on
understanding the bias behavior in pretrained language models
ComprehENotes, an Instrument to Assess Patient Reading Comprehension of Electronic Health Record Notes: Development and Validation
BACKGROUND: Patient portals are widely adopted in the United States and allow millions of patients access to their electronic health records (EHRs), including their EHR clinical notes. A patient\u27s ability to understand the information in the EHR is dependent on their overall health literacy. Although many tests of health literacy exist, none specifically focuses on EHR note comprehension.
OBJECTIVE: The aim of this paper was to develop an instrument to assess patients\u27 EHR note comprehension.
METHODS: We identified 6 common diseases or conditions (heart failure, diabetes, cancer, hypertension, chronic obstructive pulmonary disease, and liver failure) and selected 5 representative EHR notes for each disease or condition. One note that did not contain natural language text was removed. Questions were generated from these notes using Sentence Verification Technique and were analyzed using item response theory (IRT) to identify a set of questions that represent a good test of ability for EHR note comprehension.
RESULTS: Using Sentence Verification Technique, 154 questions were generated from the 29 EHR notes initially obtained. Of these, 83 were manually selected for inclusion in the Amazon Mechanical Turk crowdsourcing tasks and 55 were ultimately retained following IRT analysis. A follow-up validation with a second Amazon Mechanical Turk task and IRT analysis confirmed that the 55 questions test a latent ability dimension for EHR note comprehension. A short test of 14 items was created along with the 55-item test.
CONCLUSIONS: We developed ComprehENotes, an instrument for assessing EHR note comprehension from existing EHR notes, gathered responses using crowdsourcing, and used IRT to analyze those responses, thus resulting in a set of questions to measure EHR note comprehension. Crowdsourced responses from Amazon Mechanical Turk can be used to estimate item parameters and select a subset of items for inclusion in the test set using IRT. The final set of questions is the first test of EHR note comprehension
- âŠ