Search CORE

89 research outputs found

Linking Clinical Records to the Biomedical Literature

Author: Alnazzawi Noha
Publication venue
Publication date: 31/12/2016
Field of study

The University of Manchester - Institutional Repository

Clinical text data in machine learning: Systematic review

Author: Nenadic Goran
Spasic Irena
Publication venue: 'JMIR Publications Inc.'
Publication date: 31/03/2020
Field of study

Background: Clinical narratives represent the main form of communication within healthcare providing a personalized account of patient history and assessments, offering rich information for clinical decision making. Natural language processing (NLP) has repeatedly demonstrated its feasibility to unlock evidence buried in clinical narratives. Machine learning can facilitate rapid development of NLP tools by leveraging large amounts of text data. Objective: The main aim of this study is to provide systematic evidence on the properties of text data used to train machine learning approaches to clinical NLP. We also investigate the types of NLP tasks that have been supported by machine learning and how they can be applied in clinical practice. Methods: Our methodology was based on the guidelines for performing systematic reviews. In August 2018, we used PubMed, a multi-faceted interface, to perform a literature search against MEDLINE. We identified a total of 110 relevant studies and extracted information about the text data used to support machine learning, the NLP tasks supported and their clinical applications. The data properties considered included their size, provenance, collection methods, annotation and any relevant statistics. Results: The vast majority of datasets used to train machine learning models included only hundreds or thousands of documents. Only 10 studies used tens of thousands of documents with a handful of studies utilizing more. Relatively small datasets were utilized for training even when much larger datasets were available. The main reason for such poor data utilization is the annotation bottleneck faced by supervised machine learning algorithms. Active learning was explored to iteratively sample a subset of data for manual annotation as a strategy for minimizing the annotation effort while maximizing predictive performance of the model. Supervised learning was successfully used where clinical codes integrated with free text notes into electronic health records were utilized as class labels. Similarly, distant supervision was used to utilize an existing knowledge base to automatically annotate raw text. Where manual annotation was unavoidable, crowdsourcing was explored, but it remains unsuitable due to sensitive nature of data considered. Beside the small volume, training data were typically sourced from a small number of institutions, thus offering no hard evidence about the transferability of machine learning models. The vast majority of studies focused on the task of text classification. Most commonly, the classification results were used to support phenotyping, prognosis, care improvement, resource management and surveillance. Conclusions: We identified the data annotation bottleneck as one of the key obstacles to machine learning approaches in clinical NLP. Active learning and distant supervision were explored as a way of saving the annotation efforts. Future research in this field would benefit from alternatives such as data augmentation and transfer learning, or unsupervised learning, which does not require data annotation

Online Research @ Cardiff

Knowledge-based Biomedical Data Science 2019

Author: Callahan Tiffany J.
Hunter Lawrence E.
Pielke-Lombardo Harrison
Tripodi Ignacio J.
Publication venue
Publication date: 08/10/2019
Field of study

Knowledge-based biomedical data science (KBDS) involves the design and implementation of computer systems that act as if they knew about biomedicine. Such systems depend on formally represented knowledge in computer systems, often in the form of knowledge graphs. Here we survey the progress in the last year in systems that use formally represented knowledge to address data science problems in both clinical and biological domains, as well as on approaches for creating knowledge graphs. Major themes include the relationships between knowledge graphs and machine learning, the use of natural language processing, and the expansion of knowledge-based approaches to novel domains, such as Chinese Traditional Medicine and biodiversity.Comment: Manuscript 43 pages with 3 tables; Supplemental material 43 pages with 3 table

arXiv.org e-Print Archive

Deep Learning in Cardiology

Author: Bizopoulos Paschalis
Koutsouris Dimitrios
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 03/02/2021
Field of study

The medical field is creating large amount of data that physicians are unable to decipher and use efficiently. Moreover, rule-based expert systems are inefficient in solving complicated medical tasks or for creating insights using big data. Deep learning has emerged as a more accurate and effective technology in a wide range of medical problems such as diagnosis, prediction and intervention. Deep learning is a representation learning method that consists of layers that transform the data non-linearly, thus, revealing hierarchical relationships and structures. In this review we survey deep learning application papers that use structured data, signal and imaging modalities from cardiology. We discuss the advantages and limitations of applying deep learning in cardiology that also apply in medicine in general, while proposing certain directions as the most viable for clinical use.Comment: 27 pages, 2 figures, 10 table

arXiv.org e-Print Archive

SemClinBr -- a multi institutional and multi specialty semantically annotated corpus for Portuguese clinical NLP tasks

Author: Carvalho Deborah Ribeiro
Cintho Lilian Mie Mukai
da Silva Adalniza Moura Pucca
Gebeluca Caroline P.
Gumiel Yohan Bonescki
Hasan Sadid A.
Moro Claudia Maria Cabral
Oliveira Lucas Emanuel Silva e
Peters Ana Carolina
Publication venue
Publication date: 27/01/2020
Field of study

The high volume of research focusing on extracting patient's information from electronic health records (EHR) has led to an increase in the demand for annotated corpora, which are a very valuable resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multi-purpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field. In this study, we developed a semantically annotated corpus using clinical texts from multiple medical specialties, document types, and institutions. We present the following: (1) a survey listing common aspects and lessons learned from previous research, (2) a fine-grained annotation schema which could be replicated and guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation of the annotations. The result of this work is the SemClinBr, a corpus that has 1,000 clinical notes, labeled with 65,117 entities and 11,263 relations, and can support a variety of clinical NLP tasks and boost the EHR's secondary use for the Portuguese language

arXiv.org e-Print Archive

Recommended from our members

Learning Latent Characteristics of Data and Models using Item Response Theory

Author: Lalor John P.
Publication venue: ScholarWorks@UMass Amherst
Publication date: 26/03/2020
Field of study

A supervised machine learning model is trained with a large set of labeled training data, and evaluated on a smaller but still large set of test data. Especially with deep neural networks (DNNs), the complexity of the model requires that an extremely large data set is collected to prevent overfitting. It is often the case that these models do not take into account specific attributes of the training set examples, but instead treat each equally in the process of model training. This is due to the fact that it is difficult to model latent traits of individual examples at the scale of hundreds of thousands or millions of data points. However, there exist a set of psychometric methods that can model attributes of specific examples and can greatly improve model training and evaluation in the supervised learning process. Item Response Theory (IRT) is a well-studied psychometric methodology for scale construction and evaluation. IRT jointly models human ability and example characteristics such as difficulty based on human response data. We introduce new evaluation metrics for both humans and machine learning models build using IRT, and propose new methods for applying IRT to machine learning-scale data. We use IRT to make contributions to the machine learning community in the following areas: (i) new test sets for evaluating machine learning models with respect to a human population, (ii) new insights about how deep-learning models learn by tracking example difficulty and training conditions, and (iii) new methods for data selection and curriculum building to improve model training efficiency, (iv) a new test of electronic health literacy built with questions extracted from de-identified patient Electronic Health Records (EHRs). We first introduce two new evaluation sets built and validated using IRT. These tests are the first IRT test sets to be applied to natural language processing tasks. Using IRT test sets allows for more comprehensive comparison of NLP models. Second, by modeling the difficulty of test set examples, we identify patterns that emerge when training deep neural network models that are consistent with human learning patterns. Specifically, as models are trained with larger training sets, they learn easy test set examples more quickly than hard examples. Third, we present a method for using soft labels on a subset of training data to improve deep learning model generalization. We show that fine-tuning a trained deep neural network with as little as 0.1% of the training data can improve model generalization in terms of test set accuracy. Fourth, we propose a new method for estimating IRT example and model parameters that allows for learning parameters at a much larger scale than previously available to accommodate the large data sets required for deep learning. This allows for learning IRT models at machine learning scale, with hundreds of thousands of examples and large ensembles of machine learning models. The response patterns of machine learning models can be used to learn IRT example characteristics instead of human response patterns. Fifth, we introduce a dynamic curriculum learning process that estimates model competency during training to adaptively select training data that is appropriate for learning at the given epoch. Finally, we introduce the ComprehENotes test, the first test of EHR comprehension for humans. The test is an accurate measure for identifying individuals with low EHR note comprehension ability, and validates the effectiveness of previously self-reported patient comprehension evaluations

ScholarWorks@UMass Amherst

Natural Language Processing of Clinical Notes on Chronic Diseases: Systematic Review

Author: Dudley Joel T
Lavelli Alberto
Miotto Riccardo
Osmani Venet
Rinaldi Fabio
Sheikhalishahi Seyedmostafa
Publication venue
Publication date: 01/04/2019
Field of study

Novel approaches that complement and go beyond evidence-based medicine are required in the domain of chronic diseases, given the growing incidence of such conditions on the worldwide population. A promising avenue is the secondary use of electronic health records (EHRs), where patient data are analyzed to conduct clinical and translational research. Methods based on machine learning to process EHRs are resulting in improved understanding of patient clinical trajectories and chronic disease risk prediction, creating a unique opportunity to derive previously unknown clinical insights. However, a wealth of clinical histories remains locked behind clinical narratives in free-form text. Consequently, unlocking the full potential of EHR data is contingent on the development of natural language processing (NLP) methods to automatically transform clinical text into structured clinical data that can guide clinical decisions and potentially delay or prevent disease onset

arXiv.org e-Print Archive

Archivio della ricerca - Fondazione Bruno Kessler

ZORA