4,022 research outputs found

    An active learning-enabled annotation system for clinical named entity recognition

    Full text link
    Abstract Background Active learning (AL) has shown the promising potential to minimize the annotation cost while maximizing the performance in building statistical natural language processing (NLP) models. However, very few studies have investigated AL in a real-life setting in medical domain. Methods In this study, we developed the first AL-enabled annotation system for clinical named entity recognition (NER) with a novel AL algorithm. Besides the simulation study to evaluate the novel AL algorithm, we further conducted user studies with two nurses using this system to assess the performance of AL in real world annotation processes for building clinical NER models. Results The simulation results show that the novel AL algorithm outperformed traditional AL algorithm and random sampling. However, the user study tells a different story that AL methods did not always perform better than random sampling for different users. Conclusions We found that the increased information content of actively selected sentences is strongly offset by the increased time required to annotate them. Moreover, the annotation time was not considered in the querying algorithms. Our future work includes developing better AL algorithms with the estimation of annotation time and evaluating the system with larger number of users.https://deepblue.lib.umich.edu/bitstream/2027.42/137676/1/12911_2017_Article_466.pd

    ์•ฝ๋ฌผ ๊ฐ์‹œ๋ฅผ ์œ„ํ•œ ๋น„์ •ํ˜• ํ…์ŠคํŠธ ๋‚ด ์ž„์ƒ ์ •๋ณด ์ถ”์ถœ ์—ฐ๊ตฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์‘์šฉ๋ฐ”์ด์˜ค๊ณตํ•™๊ณผ, 2023. 2. ์ดํ˜•๊ธฐ.Pharmacovigilance is a scientific activity to detect, evaluate and understand the occurrence of adverse drug events or other problems related to drug safety. However, concerns have been raised over the quality of drug safety information for pharmacovigilance, and there is also a need to secure a new data source to acquire drug safety information. On the other hand, the rise of pre-trained language models based on a transformer architecture has accelerated the application of natural language processing (NLP) techniques in diverse domains. In this context, I tried to define two problems in pharmacovigilance as an NLP task and provide baseline models for the defined tasks: 1) extracting comprehensive drug safety information from adverse drug events narratives reported through a spontaneous reporting system (SRS) and 2) extracting drug-food interaction information from abstracts of biomedical articles. I developed annotation guidelines and performed manual annotation, demonstrating that strong NLP models can be trained to extracted clinical information from unstructrued free-texts by fine-tuning transformer-based language models on a high-quality annotated corpus. Finally, I discuss issues to consider when when developing annotation guidelines for extracting clinical information related to pharmacovigilance. The annotated corpora and the NLP models in this dissertation can streamline pharmacovigilance activities by enhancing the data quality of reported drug safety information and expanding the data sources.์•ฝ๋ฌผ ๊ฐ์‹œ๋Š” ์•ฝ๋ฌผ ๋ถ€์ž‘์šฉ ๋˜๋Š” ์•ฝ๋ฌผ ์•ˆ์ „์„ฑ๊ณผ ๊ด€๋ จ๋œ ๋ฌธ์ œ์˜ ๋ฐœ์ƒ์„ ๊ฐ์ง€, ํ‰๊ฐ€ ๋ฐ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•œ ๊ณผํ•™์  ํ™œ๋™์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์•ฝ๋ฌผ ๊ฐ์‹œ์— ์‚ฌ์šฉ๋˜๋Š” ์˜์•ฝํ’ˆ ์•ˆ์ „์„ฑ ์ •๋ณด์˜ ๋ณด๊ณ  ํ’ˆ์งˆ์— ๋Œ€ํ•œ ์šฐ๋ ค๊ฐ€ ๊พธ์ค€ํžˆ ์ œ๊ธฐ๋˜์—ˆ์œผ๋ฉฐ, ํ•ด๋‹น ๋ณด๊ณ  ํ’ˆ์งˆ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด์„œ๋Š” ์•ˆ์ „์„ฑ ์ •๋ณด๋ฅผ ํ™•๋ณดํ•  ์ƒˆ๋กœ์šด ์ž๋ฃŒ์›์ด ํ•„์š”ํ•˜๋‹ค. ํ•œํŽธ ํŠธ๋žœ์Šคํฌ๋จธ ์•„ํ‚คํ…์ฒ˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์‚ฌ์ „ํ›ˆ๋ จ ์–ธ์–ด๋ชจ๋ธ์ด ๋“ฑ์žฅํ•˜๋ฉด์„œ ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์—์„œ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๊ธฐ์ˆ  ์ ์šฉ์ด ๊ฐ€์†ํ™”๋˜์—ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋งฅ๋ฝ์—์„œ ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ๋Š” ์•ฝ๋ฌผ ๊ฐ์‹œ๋ฅผ ์œ„ํ•œ ๋‹ค์Œ 2๊ฐ€์ง€ ์ •๋ณด ์ถ”์ถœ ๋ฌธ์ œ๋ฅผ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ฌธ์ œ ํ˜•ํƒœ๋กœ ์ •์˜ํ•˜๊ณ  ๊ด€๋ จ ๊ธฐ์ค€ ๋ชจ๋ธ์„ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค: 1) ์ˆ˜๋™์  ์•ฝ๋ฌผ ๊ฐ์‹œ ์ฒด๊ณ„์— ๋ณด๊ณ ๋œ ์ด์ƒ์‚ฌ๋ก€ ์„œ์ˆ ์ž๋ฃŒ์—์„œ ํฌ๊ด„์ ์ธ ์•ฝ๋ฌผ ์•ˆ์ „์„ฑ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•œ๋‹ค. 2) ์˜๋ฌธ ์˜์•ฝํ•™ ๋…ผ๋ฌธ ์ดˆ๋ก์—์„œ ์•ฝ๋ฌผ-์‹ํ’ˆ ์ƒํ˜ธ์ž‘์šฉ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์•ˆ์ „์„ฑ ์ •๋ณด ์ถ”์ถœ์„ ์œ„ํ•œ ์–ด๋…ธํ…Œ์ด์…˜ ๊ฐ€์ด๋“œ๋ผ์ธ์„ ๊ฐœ๋ฐœํ•˜๊ณ  ์ˆ˜์ž‘์—…์œผ๋กœ ์–ด๋…ธํ…Œ์ด์…˜์„ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ๊ณ ํ’ˆ์งˆ์˜ ์ž์—ฐ์–ด ํ•™์Šต๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์‚ฌ์ „ํ•™์Šต ์–ธ์–ด๋ชจ๋ธ์„ ๋ฏธ์„ธ ์กฐ์ •ํ•จ์œผ๋กœ์จ ๋น„์ •ํ˜• ํ…์ŠคํŠธ์—์„œ ์ž„์ƒ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ฐ•๋ ฅํ•œ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ชจ๋ธ ๊ฐœ๋ฐœ์ด ๊ฐ€๋Šฅํ•จ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ๋Š” ์•ฝ๋ฌผ๊ฐ์‹œ์™€ ๊ด€๋ จ๋œ์ž„์ƒ ์ •๋ณด ์ถ”์ถœ์„ ์œ„ํ•œ ์–ด๋…ธํ…Œ์ด์…˜ ๊ฐ€์ด๋“œ๋ผ์ธ์„ ๊ฐœ๋ฐœํ•  ๋•Œ ๊ณ ๋ คํ•ด์•ผ ํ•  ์ฃผ์˜ ์‚ฌํ•ญ์— ๋Œ€ํ•ด ๋…ผ์˜ํ•˜์˜€๋‹ค. ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ ์†Œ๊ฐœํ•œ ์ž์—ฐ์–ด ํ•™์Šต๋ฐ์ดํ„ฐ์™€ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ชจ๋ธ์€ ์•ฝ๋ฌผ ์•ˆ์ „์„ฑ ์ •๋ณด์˜ ๋ณด๊ณ  ํ’ˆ์งˆ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ณ  ์ž๋ฃŒ์›์„ ํ™•์žฅํ•˜์—ฌ ์•ฝ๋ฌผ ๊ฐ์‹œ ํ™œ๋™์„ ๋ณด์กฐํ•  ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€๋œ๋‹ค.Chapter 1 1 1.1 Contributions of this dissertation 2 1.2 Overview of this dissertation 2 1.3 Other works 3 Chapter 2 4 2.1 Pharmacovigilance 4 2.2 Biomedical NLP for pharmacovigilance 6 2.2.1 Pre-trained language models 6 2.2.2 Corpora to extract clinical information for pharmacovigilance 9 Chapter 3 11 3.1 Motivation 12 3.2 Proposed Methods 14 3.2.1 Data source and text corpus 15 3.2.2 Annotation of ADE narratives 16 3.2.3 Quality control of annotation 17 3.2.4 Pretraining KAERS-BERT 18 3.2.6 Named entity recognition 20 3.2.7 Entity label classification and sentence extraction 21 3.2.8 Relation extraction 21 3.2.9 Model evaluation 22 3.2.10 Ablation experiment 23 3.3 Results 24 3.3.1 Annotated ICSRs 24 3.3.2 Corpus statistics 26 3.3.3 Performance of NLP models to extract drug safety information 28 3.3.4 Ablation experiment 31 3.4 Discussion 33 3.5 Conclusion 38 Chapter 4 39 4.1 Motivation 39 4.2 Proposed Methods 43 4.2.1 Data source 44 4.2.2 Annotation 45 4.2.3 Quality control of annotation 49 4.2.4 Baseline model development 49 4.3 Results 50 4.3.1 Corpus statistics 50 4.3.2 Annotation Quality 54 4.3.3 Performance of baseline models 55 4.3.4 Qualitative error analysis 56 4.4 Discussion 59 4.5 Conclusion 63 Chapter 5 64 5.1 Issues around defining a word entity 64 5.2 Issues around defining a relation between word entities 66 5.3 Issues around defining entity labels 68 5.4 Issues around selecting and preprocessing annotated documents 68 Chapter 6 71 6.1 Dissertation summary 71 6.2 Limitation and future works 72 6.2.1 Development of end-to-end information extraction models from free-texts to database based on existing structured information 72 6.2.2 Application of in-context learning framework in clinical information extraction 74 Chapter 7 76 7.1 Annotation Guideline for "Extraction of Comprehensive Drug Safety Information from Adverse Event Narratives Reported through Spontaneous Reporting System" 76 7.2 Annotation Guideline for "Extraction of Drug-Food Interactions from the Abtracts of Biomedical Articles" 100๋ฐ•

    Clinical text data in machine learning: Systematic review

    Get PDF
    Background: Clinical narratives represent the main form of communication within healthcare providing a personalized account of patient history and assessments, offering rich information for clinical decision making. Natural language processing (NLP) has repeatedly demonstrated its feasibility to unlock evidence buried in clinical narratives. Machine learning can facilitate rapid development of NLP tools by leveraging large amounts of text data. Objective: The main aim of this study is to provide systematic evidence on the properties of text data used to train machine learning approaches to clinical NLP. We also investigate the types of NLP tasks that have been supported by machine learning and how they can be applied in clinical practice. Methods: Our methodology was based on the guidelines for performing systematic reviews. In August 2018, we used PubMed, a multi-faceted interface, to perform a literature search against MEDLINE. We identified a total of 110 relevant studies and extracted information about the text data used to support machine learning, the NLP tasks supported and their clinical applications. The data properties considered included their size, provenance, collection methods, annotation and any relevant statistics. Results: The vast majority of datasets used to train machine learning models included only hundreds or thousands of documents. Only 10 studies used tens of thousands of documents with a handful of studies utilizing more. Relatively small datasets were utilized for training even when much larger datasets were available. The main reason for such poor data utilization is the annotation bottleneck faced by supervised machine learning algorithms. Active learning was explored to iteratively sample a subset of data for manual annotation as a strategy for minimizing the annotation effort while maximizing predictive performance of the model. Supervised learning was successfully used where clinical codes integrated with free text notes into electronic health records were utilized as class labels. Similarly, distant supervision was used to utilize an existing knowledge base to automatically annotate raw text. Where manual annotation was unavoidable, crowdsourcing was explored, but it remains unsuitable due to sensitive nature of data considered. Beside the small volume, training data were typically sourced from a small number of institutions, thus offering no hard evidence about the transferability of machine learning models. The vast majority of studies focused on the task of text classification. Most commonly, the classification results were used to support phenotyping, prognosis, care improvement, resource management and surveillance. Conclusions: We identified the data annotation bottleneck as one of the key obstacles to machine learning approaches in clinical NLP. Active learning and distant supervision were explored as a way of saving the annotation efforts. Future research in this field would benefit from alternatives such as data augmentation and transfer learning, or unsupervised learning, which does not require data annotation

    LeafAI: query generator for clinical cohort discovery rivaling a human programmer

    Full text link
    Objective: Identifying study-eligible patients within clinical databases is a critical step in clinical research. However, accurate query design typically requires extensive technical and biomedical expertise. We sought to create a system capable of generating data model-agnostic queries while also providing novel logical reasoning capabilities for complex clinical trial eligibility criteria. Materials and Methods: The task of query creation from eligibility criteria requires solving several text-processing problems, including named entity recognition and relation extraction, sequence-to-sequence transformation, normalization, and reasoning. We incorporated hybrid deep learning and rule-based modules for these, as well as a knowledge base of the Unified Medical Language System (UMLS) and linked ontologies. To enable data-model agnostic query creation, we introduce a novel method for tagging database schema elements using UMLS concepts. To evaluate our system, called LeafAI, we compared the capability of LeafAI to a human database programmer to identify patients who had been enrolled in 8 clinical trials conducted at our institution. We measured performance by the number of actual enrolled patients matched by generated queries. Results: LeafAI matched a mean 43% of enrolled patients with 27,225 eligible across 8 clinical trials, compared to 27% matched and 14,587 eligible in queries by a human database programmer. The human programmer spent 26 total hours crafting queries compared to several minutes by LeafAI. Conclusions: Our work contributes a state-of-the-art data model-agnostic query generation system capable of conditional reasoning using a knowledge base. We demonstrate that LeafAI can rival a human programmer in finding patients eligible for clinical trials

    The devices, experimental scaffolds, and biomaterials ontology (DEB): a tool for mapping, annotation, and analysis of biomaterials' data

    Get PDF
    The size and complexity of the biomaterials literature makes systematic data analysis an excruciating manual task. A practical solution is creating databases and information resources. Implant design and biomaterials research can greatly benefit from an open database for systematic data retrieval. Ontologies are pivotal to knowledge base creation, serving to represent and organize domain knowledge. To name but two examples, GO, the gene ontology, and CheBI, Chemical Entities of Biological Interest ontology and their associated databases are central resources to their respective research communities. The creation of the devices, experimental scaffolds, and biomaterials ontology (DEB), an open resource for organizing information about biomaterials, their design, manufacture, and biological testing, is described. It is developed using text analysis for identifying ontology terms from a biomaterials gold standard corpus, systematically curated to represent the domain's lexicon. Topics covered are validated by members of the biomaterials research community. The ontology may be used for searching terms, performing annotations for machine learning applications, standardized meta-data indexing, and other cross-disciplinary data exploitation. The input of the biomaterials community to this effort to create data-driven open-access research tools is encouraged and welcomed.Preprin

    How will the Internet of Things enable Augmented Personalized Health?

    Full text link
    Internet-of-Things (IoT) is profoundly redefining the way we create, consume, and share information. Health aficionados and citizens are increasingly using IoT technologies to track their sleep, food intake, activity, vital body signals, and other physiological observations. This is complemented by IoT systems that continuously collect health-related data from the environment and inside the living quarters. Together, these have created an opportunity for a new generation of healthcare solutions. However, interpreting data to understand an individual's health is challenging. It is usually necessary to look at that individual's clinical record and behavioral information, as well as social and environmental information affecting that individual. Interpreting how well a patient is doing also requires looking at his adherence to respective health objectives, application of relevant clinical knowledge and the desired outcomes. We resort to the vision of Augmented Personalized Healthcare (APH) to exploit the extensive variety of relevant data and medical knowledge using Artificial Intelligence (AI) techniques to extend and enhance human health to presents various stages of augmented health management strategies: self-monitoring, self-appraisal, self-management, intervention, and disease progress tracking and prediction. kHealth technology, a specific incarnation of APH, and its application to Asthma and other diseases are used to provide illustrations and discuss alternatives for technology-assisted health management. Several prominent efforts involving IoT and patient-generated health data (PGHD) with respect converting multimodal data into actionable information (big data to smart data) are also identified. Roles of three components in an evidence-based semantic perception approach- Contextualization, Abstraction, and Personalization are discussed
    • โ€ฆ
    corecore