Search CORE

459 research outputs found

Extracting information from the text of electronic medical records to improve case detection: a systematic review

Author: Afzal
Afzal
Ananthakrishnan
Baus
Cano
Carroll
Carroll
Carroll
Castro
Chapman
Chen
Chung
Currie
de Lusignan
DeLisle
DeLisle
Donia Scott
Dorr
Elizabeth Ford
Ford
Friedlin
Friedman
Friedman
Graiser
Greenhalgh
Gulliford
Gundlapalli
Hanauer
Hanauer
Hanauer
Harkema
Helen E Smith
Imfeld
Jackie A Cassell
John A Carroll
Jones
Kalra
Karnik
Koeling
Kushida
Li
Liao
Lin
Lindberg
Love
Lovis
Ludvigsson
Manning
Manuel
McPeek Hinz
Mehrabi
Meystre
Nielen
Pakhomov
Pakhomov
Powsner
Rait
Resnik
Roch
Ryan
Savova
Soler
Stein
Stone
Tange
Tate
Tsui
Uzuner
Valkhoff
Walsh
Widdifield
Wilke
Wu
Xia
Xu
Xu
Yadav
Ye
Zeng
Zeng
Zheng
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2016
Field of study

Background: Electronic medical records (EMRs) are revolutionizing health-related research. One key issue for study quality is the accurate identification of patients with the condition of interest. Information in EMRs can be entered as structured codes or unstructured free text. The majority of research studies have used only coded parts of EMRs for case-detection, which may bias findings, miss cases, and reduce study quality. This review examines whether incorporating information from text into case-detection algorithms can improve research quality. Methods: A systematic search returned 9659 papers, 67 of which reported on the extraction of information from free text of EMRs with the stated purpose of detecting cases of a named clinical condition. Methods for extracting information from text and the technical accuracy of case-detection algorithms were reviewed. Results: Studies mainly used US hospital-based EMRs, and extracted information from text for 41 conditions using keyword searches, rule-based algorithms, and machine learning methods. There was no clear difference in case-detection algorithm accuracy between rule-based and machine learning methods of extraction. Inclusion of information from text resulted in a significant improvement in algorithm sensitivity and area under the receiver operating characteristic in comparison to codes alone (median sensitivity 78% (codes + text) vs 62% (codes), P = .03; median area under the receiver operating characteristic 95% (codes + text) vs 88% (codes), P = .025). Conclusions: Text in EMRs is accessible, especially with open source information extraction algorithms, and significantly improves case detection when combined with codes. More harmonization of reporting within EMR studies is needed, particularly standardized reporting of algorithm accuracy metrics like positive predictive value (precision) and sensitivity (recall)

Crossref

PubMed Central

Sussex Research Online

Modeling Disease Severity in Multiple Sclerosis Using Electronic Health Records

Author: Ananthakrishnan Ashwin N.
Bove Riley M.
Cagan Andrew
Cai Tianxi
Chen Pei
Cheng Suchun
Chibnik Lori B.
Chitnis Tanuja
Churchill Susanne
De Jager Philip L.
Gainer Vivian
Karlson Elizabeth W.
Kohane Isaac
Liao Katherine P.
Murphy Shawn N.
Plenge Robert M.
Savova Guergana K.
Secor Elizabeth
Shaw Stanley Y.
Szolovits Peter
Weiner Howard L.
Xia Zongqi
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2013
Field of study

Objective: To optimally leverage the scalability and unique features of the electronic health records (EHR) for research that would ultimately improve patient care, we need to accurately identify patients and extract clinically meaningful measures. Using multiple sclerosis (MS) as a proof of principle, we showcased how to leverage routinely collected EHR data to identify patients with a complex neurological disorder and derive an important surrogate measure of disease severity heretofore only available in research settings. Methods: In a cross-sectional observational study, 5,495 MS patients were identified from the EHR systems of two major referral hospitals using an algorithm that includes codified and narrative information extracted using natural language processing. In the subset of patients who receive neurological care at a MS Center where disease measures have been collected, we used routinely collected EHR data to extract two aggregate indicators of MS severity of clinical relevance multiple sclerosis severity score (MSSS) and brain parenchymal fraction (BPF, a measure of whole brain volume). Results: The EHR algorithm that identifies MS patients has an area under the curve of 0.958, 83% sensitivity, 92% positive predictive value, and 89% negative predictive value when a 95% specificity threshold is used. The correlation between EHR-derived and true MSSS has a mean R[superscript 2] = 0.38±0.05, and that between EHR-derived and true BPF has a mean R[superscript 2] = 0.22±0.08. To illustrate its clinical relevance, derived MSSS captures the expected difference in disease severity between relapsing-remitting and progressive MS patients after adjusting for sex, age of symptom onset and disease duration (p = 1.56×10[superscript −12]). Conclusion: Incorporation of sophisticated codified and narrative EHR data accurately identifies MS patients and provides estimation of a well-accepted indicator of MS severity that is widely used in research settings but not part of the routine medical records. Similar approaches could be applied to other complex neurological disorders.National Institute of General Medical Sciences (U.S.) (NIH U54-LM008748

DSpace@MIT

Crossref

Harvard University - DASH

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

Recommended from our members

Automatic Prediction of Rheumatoid Arthritis Disease Activity from the Electronic Medical Records

Author: Canhao Helena
Chen Pei Jun
Dligach Dmitriy
Karlson Elizabeth W.
Lin Chen
Miller Timothy A.
Perez Raul Natanael Guzman
Plenge Robert M.
Savova Guergana K.
Shadick Nancy A.
Shen Yuanyan
Weinblatt Michael E.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 16/08/2013
Field of study

Objective: We aimed to mine the data in the Electronic Medical Record to automatically discover patients' Rheumatoid Arthritis disease activity at discrete rheumatology clinic visits. We cast the problem as a document classification task where the feature space includes concepts from the clinical narrative and lab values as stored in the Electronic Medical Record. Materials and Methods The Training Set consisted of 2792 clinical notes and associated lab values. Test Set 1 included 1749 clinical notes and associated lab values. Test Set 2 included 344 clinical notes for which there were no associated lab values. The Apache clinical Text Analysis and Knowledge Extraction System was used to analyze the text and transform it into informative features to be combined with relevant lab values. Results: Experiments over a range of machine learning algorithms and features were conducted. The best performing combination was linear kernel Support Vector Machines with Unified Medical Language System Concept Unique Identifier features with feature selection and lab values. The Area Under the Receiver Operating Characteristic Curve (AUC) is 0.831 (σ = 0.0317), statistically significant as compared to two baselines (AUC = 0.758, σ = 0.0291). Algorithms demonstrated superior performance on cases clinically defined as extreme categories of disease activity (Remission and High) compared to those defined as intermediate categories (Moderate and Low) and included laboratory data on inflammatory markers. Conclusion: Automatic Rheumatoid Arthritis disease activity discovery from Electronic Medical Record data is a learnable task approximating human performance. As a result, this approach might have several research applications, such as the identification of patients for genome-wide pharmacogenetic studies that require large sample sizes with precise definitions of disease activity and response to therapies

Harvard University - DASH

FigShare

Methods to Develop an Electronic Medical Record Phenotype Algorithm to Compare the Risk of Coronary Artery Disease across 3 Chronic Disease Cohorts

Author: Agniel Denis
Ananthakrishnan Ashwin N.
Cagan Andrew
Cai Tianxi
Chen Pei
Churchill Susanne
Gainer Vivian S.
Goryachev Sergey
Karlson Elizabeth W.
Kohane Isaac
Kumar Vishesh
Lee Jaeyoung
Liao Katherine P.
Murphy Shawn N.
Plenge Robert M.
Savova Guergana K.
Shaw Stanley Y.
Szolovits Peter
Xia Zongqi
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/09/2014
Field of study

Background Typically, algorithms to classify phenotypes using electronic medical record (EMR) data were developed to perform well in a specific patient population. There is increasing interest in analyses which can allow study of a specific outcome across different diseases. Such a study in the EMR would require an algorithm that can be applied across different patient populations. Our objectives were: (1) to develop an algorithm that would enable the study of coronary artery disease (CAD) across diverse patient populations; (2) to study the impact of adding narrative data extracted using natural language processing (NLP) in the algorithm. Additionally, we demonstrate how to implement CAD algorithm to compare risk across 3 chronic diseases in a preliminary study. Methods and Results We studied 3 established EMR based patient cohorts: diabetes mellitus (DM, n = 65,099), inflammatory bowel disease (IBD, n = 10,974), and rheumatoid arthritis (RA, n = 4,453) from two large academic centers. We developed a CAD algorithm using NLP in addition to structured data (e.g. ICD9 codes) in the RA cohort and validated it in the DM and IBD cohorts. The CAD algorithm using NLP in addition to structured data achieved specificity >95% with a positive predictive value (PPV) 90% in the training (RA) and validation sets (IBD and DM). The addition of NLP data improved the sensitivity for all cohorts, classifying an additional 17% of CAD subjects in IBD and 10% in DM while maintaining PPV of 90%. The algorithm classified 16,488 DM (26.1%), 457 IBD (4.2%), and 245 RA (5.0%) with CAD. In a cross-sectional analysis, CAD risk was 63% lower in RA and 68% lower in IBD compared to DM (p<0.0001) after adjusting for traditional cardiovascular risk factors. Conclusions We developed and validated a CAD algorithm that performed well across diverse patient populations. The addition of NLP into the CAD algorithm improved the sensitivity of the algorithm, particularly in cohorts where the prevalence of CAD was low. Preliminary data suggest that CAD risk was significantly lower in RA and IBD compared to DM.National Institutes of Health (U.S.). Informatics for Integrating Biology and the Bedside Project (U54LM008748

DSpace@MIT

Crossref

Harvard University - DASH

Directory of Open Access Journals

PubMed Central

FigShare

Extracting research-quality phenotypes from electronic health records to support precision medicine

Author
Publication venue: BioMed Central
Publication date: 30/04/2015
Field of study

Springer - Publisher Connector

Electronic Medical Records for Discovery Research in Rheumatoid Arthritis

Author: Arnett
Banal
Bates
Berner
Bukhari
DesRoches
Effler
Forslind
Gabriel
Greenberg
Hastie
Jha
Jha
Katz
Klompas
Lazarus
Lee
Levin
Losina
Meystre
Meystre
Murphy
Penz
Poon
Schneeweiss
Singh
Solomon
Solti
Trivedi
Turchin
Zeng
Zou
Zou
Publication venue: 'Wiley'
Publication date: 01/03/2010
Field of study

Objective: Electronic medical records (EMRs) are a rich data source for discovery research but are underutilized due to the difficulty of extracting highly accurate clinical data. We assessed whether a classification algorithm incorporating narrative EMR data (typed physician notes) more accurately classifies subjects with rheumatoid arthritis (RA) compared with an algorithm using codified EMR data alone. Methods: Subjects with ≥1 International Classification of Diseases, Ninth Revision RA code (714.xx) or who had anti–cyclic citrullinated peptide (anti-CCP) checked in the EMR of 2 large academic centers were included in an “RA Mart” (n = 29,432). For all 29,432 subjects, we extracted narrative (using natural language processing) and codified RA clinical information. In a training set of 96 RA and 404 non-RA cases from the RA Mart classified by medical record review, we used narrative and codified data to develop classification algorithms using logistic regression. These algorithms were applied to the entire RA Mart. We calculated and compared the positive predictive value (PPV) of these algorithms by reviewing the records of an additional 400 subjects classified as having RA by the algorithms. Results: A complete algorithm (narrative and codified data) classified RA subjects with a significantly higher PPV of 94% than an algorithm with codified data alone (PPV of 88%). Characteristics of the RA cohort identified by the complete algorithm were comparable to existing RA cohorts (80% women, 63% anti-CCP positive, and 59% positive for erosions). Conclusion: We demonstrate the ability to utilize complete EMR data to define an RA cohort with a PPV of 94%, which was superior to an algorithm using codified data alone.National Library of Medicine (U.S.) (Award U54LM008748)National Institutes of Health (U.S.). i2b2 (Informatics for Integrating Biology and the Bedside) (Grant U54-LM008748

DSpace@MIT

Crossref

PubMed Central

The University of Manchester - Institutional Repository

Efficient Development of Electronic Health Record Based Algorithms to Identify Rheumatoid Arthritis

Author: Carroll Robert James
Publication venue: VANDERBILT
Publication date
Field of study

Analyzing the heterogeneity of rule-based EHR phenotyping algorithms in CALIBER and the UK Biobank

Author: Denaxas S
Fitzpatrick N
Hemingway H
Parkinson H
Sudlow C
Publication venue: CEUR
Publication date: 21/08/2019
Field of study

Electronic Health Records (EHR) are data generated during routine interactions across healthcare settings and contain rich, longitudinal information on diagnoses, symptoms, medications, investigations and tests. A primary use-case for EHR is the creation of phenotyping algorithms used to identify disease status, onset and progression or extraction of information on risk factors or biomarkers. Phenotyping however is challenging since EHR are collected for different purposes, have variable data quality and often require significant harmonization. While considerable effort goes into the phenotyping process, no consistent methodology for representing algorithms exists in the UK. Creating a national repository of curated algorithms can potentially enable algorithm dissemination and reuse by the wider community. A critical first step is the creation of a robust minimum information standard for phenotyping algorithm components (metadata, implementation logic, validation evidence) which involves identifying and reviewing the complexity and heterogeneity of current UK EHR algorithms. In this study, we analyzed all available EHR phenotyping algorithms (n=70) from two large-scale contemporary EHR resources in the UK (CALIBER and UK Biobank). We documented EHR sources, controlled clinical terminologies, evidence of algorithm validation, representation and implementation logic patterns. Understanding the heterogeneity of UK EHR algorithms and identifying common implementation patterns will facilitate the design of a minimum information standard for representing and curating algorithms nationally and internationally

UCL Discovery

Improving Case Definition of Crohnʼs Disease and Ulcerative Colitis in Electronic Medical Records Using Natural Language Processing

Author: Ananthakrishnan Ashwin N.
Cai Tianxi
Chen Pei
Cheng Su-Chun
Churchill Susanne
Gainer Vivian
Karlson Elizabeth W.
Kohane Isaac
Liao Katherine P.
Murphy Shawn N.
Perez Raul Guzman
Plenge Robert M.
Savova Guergana
Shaw Stanley
Szolovits Peter
Xia Zongqi
Publication venue: 'Ovid Technologies (Wolters Kluwer Health)'
Publication date: 01/06/2013
Field of study

available in PMC 2014 June 01Background: Previous studies identifying patients with inflammatory bowel disease using administrative codes have yielded inconsistent results. Our objective was to develop a robust electronic medical record–based model for classification of inflammatory bowel disease leveraging the combination of codified data and information from clinical text notes using natural language processing. Methods: Using the electronic medical records of 2 large academic centers, we created data marts for Crohn’s disease (CD) and ulcerative colitis (UC) comprising patients with ≥1 International Classification of Diseases, 9th edition, code for each disease. We used codified (i.e., International Classification of Diseases, 9th edition codes, electronic prescriptions) and narrative data from clinical notes to develop our classification model. Model development and validation was performed in a training set of 600 randomly selected patients for each disease with medical record review as the gold standard. Logistic regression with the adaptive LASSO penalty was used to select informative variables. Results: We confirmed 399 CD cases (67%) in the CD training set and 378 UC cases (63%) in the UC training set. For both, a combined model including narrative and codified data had better accuracy (area under the curve for CD 0.95; UC 0.94) than models using only disease International Classification of Diseases, 9th edition codes (area under the curve 0.89 for CD; 0.86 for UC). Addition of natural language processing narrative terms to our final model resulted in classification of 6% to 12% more subjects with the same accuracy. Conclusions: Inclusion of narrative concepts identified using natural language processing improves the accuracy of electronic medical records case definition for CD and UC while simultaneously identifying more subjects compared with models using codified data alone.National Institutes of Health (U.S.) (NIH U54-LM008748)American Gastroenterological AssociationNational Institutes of Health (U.S.) (NIH K08 AR060257)Beth Isreal Deaconess Medical Center (Katherine Swan Ginsburg Fund)National Institutes of Health (U.S.) (NIH R01-AR056768)Burroughs Wellcome Fund (Career Award for Medical Scientists)National Institutes of Health (U.S.) (NIH U01-GM092691)National Institutes of Health (U.S.) (NIH R01-AR059648

DSpace@MIT

Crossref

PubMed Central