20 research outputs found

    Natural language processing (NLP) for clinical information extraction and healthcare research

    Get PDF
    Introduction: Epilepsy is a common disease with multiple comorbidities. Routinely collected health care data have been successfully used in epilepsy research, but they lack the level of detail needed for in-depth study of complex interactions between the aetiology, comorbidities, and treatment that affect patient outcomes. The aim of this work is to use natural language processing (NLP) technology to create detailed disease-specific datasets derived from the free text of clinic letters in order to enrich the information that is already available. Method: An NLP pipeline for the extraction of epilepsy clinical text (ExECT) was redeveloped to extract a wider range of variables. A gold standard annotation set for epilepsy clinic letters was created for the validation of the ExECT v2 output. A set of clinic letters from the Epi25 study was processed and the datasets produced were validated against Swansea Neurology Biobank records. A data linkage study investigating genetic influences on epilepsy outcomes using GP and hospital records was supplemented with the seizure frequency dataset produced by ExECT v2. Results: The validation of ExECT v2 produced overall precision, recall, and F1 score of 0.90, 0.86, and 0.88, respectively. A method of uploading, annotating, and linking genetic variant datasets within the SAIL databank was established. No significant differences in the genetic burden of rare and potentially damaging variants were observed between the individuals with vs without unscheduled admissions, and between individuals on monotherapy vs polytherapy. No significant difference was observed in the genetic burden between people who were seizure free for over a year and those who experienced at least one seizure a year. Conclusion: This work presents successful extraction of epilepsy clinical information and explores how this information can be used in epilepsy research. The approach taken in the development of ExECT v2, and the research linking the NLP outputs, routinely collected health care data, and genetics set the way for wider research

    Incidence, Prevalence, and Health Care Outcomes in Idiopathic Intracranial Hypertension

    Get PDF
    Objective: To characterise trends in incidence, prevalence, and healthcare outcomes in the idiopathic intracranial hypertension (IIH) population in Wales using routinely collected healthcare data.Methods: We used and validated primary and secondary care IIH diagnosis codes within the Secure Anonymised Information Linkage databank, to ascertain IIH cases and controls, in a retrospective cohort study between 2003 and 2017. We recorded body mass index (BMI), deprivation quintile, CSF diversion surgery and unscheduled hospital admissions in case and control cohorts.Results: We analysed 35 million patient years of data. There were 1765 cases of IIH in 2017 (85% female). The prevalence and incidence of IIH in 2017 was 76/100,000 and 7.8/100,000/year, a significant increase from 2003 (corresponding figures=12/100,000 and 2.3/100,000/year) (p<0.001). IIH prevalence is associated with increasing BMI and increasing deprivation. The odds ratio for developing IIH in the least deprived quintile compared to the most deprived quintile, adjusted for gender and BMI, was 0.65 (95% CI 0.55 to 0.76). 9% of IIH cases had CSF shunts with less than 0.2% having bariatric surgery. Unscheduled hospital admissions were higher in the IIH cohort compared to controls (rate ratio=5.28, p<0.001) and in individuals with IIH and CSF shunts compared to those without shunts (rate ratio=2.02, p<0.01).Conclusions: IIH incidence and prevalence is increasing considerably, corresponding to population increases in BMI, and is associated with increased deprivation. This has important implications for healthcare professionals and policy makers given the comorbidities, complications and increased healthcare utilization associated with II

    Markup: A Web-Based Annotation Tool Powered by Active Learning

    Get PDF
    Across various domains, such as health and social care, law, news, and social media, there are increasing quantities of unstructured texts being produced. These potential data sources often contain rich information that could be used for domain-specific and research purposes. However, the unstructured nature of free-text data poses a significant challenge for its utilisation due to the necessity of substantial manual intervention from domain-experts to label embedded information. Annotation tools can assist with this process by providing functionality that enables the accurate capture and transformation of unstructured texts into structured annotations, which can be used individually, or as part of larger Natural Language Processing (NLP) pipelines. We present Markup (https://www.getmarkup.com/) an open-source, web-based annotation tool that is undergoing continued development for use across all domains. Markup incorporates NLP and Active Learning (AL) technologies to enable rapid and accurate annotation using custom user configurations, predictive annotation suggestions, and automated mapping suggestions to both domain-specific ontologies, such as the Unified Medical Language System (UMLS), and custom, user-defined ontologies. We demonstrate a real-world use case of how Markup has been used in a healthcare setting to annotate structured information from unstructured clinic letters, where captured annotations were used to build and test NLP applications

    Genetic influences on epilepsy outcomes: a whole‐exome sequencing and healthcare records data linkage study

    Get PDF
    Objective: This study was undertaken to develop a novel pathway linking genetic data with routinely collected data for people with epilepsy, and to analyze the influence of rare, deleterious genetic variants on epilepsy outcomes. Methods: We linked whole-exome sequencing (WES) data with routinely collected primary and secondary care data and natural language processing (NLP)-derived seizure frequency information for people with epilepsy within the Secure Anonymised Information Linkage Databank. The study participants were adults who had consented to participate in the Swansea Neurology Biobank, Wales, between 2016 and 2018. DNA sequencing was carried out as part of the Epi25 collaboration. For each individual, we calculated the total number and cumulative burden of rare and predicted deleterious genetic variants and the total of rare and deleterious variants in epilepsy and drug metabolism genes. We compared these measures with the following outcomes: (1) no unscheduled hospital admissions versus unscheduled admissions for epilepsy, (2) antiseizure medication (ASM) monotherapy versus polytherapy, and (3) at least 1 year of seizure freedom versus <1 year of seizure freedom. Results: We linked genetic data for 107 individuals with epilepsy (52% female) to electronic health records. Twenty-six percent had unscheduled hospital admissions, and 70% were prescribed ASM polytherapy. Seizure frequency information was linked for 100 individuals, and 10 were seizure-free. There was no significant difference between the outcome groups in terms of the exome-wide and gene-based burden of rare and deleterious genetic variants. Significance: We successfully uploaded, annotated, and linked genetic sequence data and NLP-derived seizure frequency data to anonymized health care records in this proof-of-concept study. We did not detect a genetic influence on real-world epilepsy outcomes, but our study was limited by a small sample size. Future studies will require larger (WES) data to establish genetic variant contribution to epilepsy outcomes

    Epilepsy, antiepileptic drugs, and the risk of major cardiovascular events

    Get PDF
    ObjectiveThis study was undertaken to determine whether epilepsy and antiepileptic drugs (including enzyme-inducing and non-enzyme-inducing drugs) are associated with major cardiovascular events using population-level, routinely collected data.MethodsUsing anonymized, routinely collected, health care data in Wales, UK, we performed a retrospective matched cohort study (2003–2017) of adults with epilepsy prescribed an antiepileptic drug. Controls were matched with replacement on age, gender, deprivation quintile, and year of entry into the study. Participants were followed to the end of the study for the occurrence of a major cardiovascular event, and survival models were constructed to compare the time to a major cardiovascular event (cardiac arrest, myocardial infarction, stroke, ischemic heart disease, clinically significant arrhythmia, thromboembolism, onset of heart failure, or a cardiovascular death) for individuals in the case group versus the control group.ResultsThere were 10 241 cases (mean age = 49.6 years, 52.2% male, mean follow-up = 6.1 years) matched to 35 145 controls. A total of 3180 (31.1%) cases received enzyme-inducing antiepileptic drugs, and 7061 (68.9%) received non-enzyme-inducing antiepileptic drugs. Cases had an increased risk of experiencing a major cardiovascular event compared to controls (adjusted hazard ratio = 1.58, 95% confidence interval [CI] = 1.51–1.63, p < .001). There was no notable difference in major cardiovascular events between those treated with enzyme-inducing antiepileptic drugs and those treated with non-enzyme-inducing antiepileptic drugs (adjusted hazard ratio = .95, 95% CI = .86–1.05, p = .300).SignificanceIndividuals with epilepsy prescribed antiepileptic drugs are at an increased risk of major cardiovascular events compared with population controls. Being prescribed an enzyme-inducing antiepileptic drug is not associated with a greater risk of a major cardiovascular event compared to treatment with other antiepileptic drugs. Our data emphasize the importance of cardiovascular risk management in the clinical care of people with epilepsy

    Using natural language processing to extract structured epilepsy data from unstructured clinic letters

    Get PDF
    Introduction Electronic health records (EHR) are a powerful resource in enabling large-scale healthcare research. EHRs often lack detailed disease-specific information that is collected in free text within clinical settings. This challenge can be addressed by using Natural Language Processing (NLP) to derive and extract detailed clinical information from free text. Objectives and Approach Using a training sample of 40 letters, we used the General Architecture for Text Engineering (GATE) framework to build custom rule sets for nine categories of epilepsy information as well as clinic date and date of birth. We used a validation set of 200 clinic letters to compare the results of our algorithm to a separate manual review by a clinician, where we evaluated a “per item” and a “per letter” approach for each category. Results The “per letter” approach identified 1,939 items of information with overall precision, recall and F1-score of 92.7%, 77.7% and 85.6%. Precision and recall for epilepsy specific categories were: diagnosis (85.3%,92.4%),  type (93.7%,83.2%), focal seizure (99.0%,68.3%), generalised seizure (92.5%,57.0%), seizure frequency (92.0%,52.3%), medication (96.1%,94.0%), CT (66.7%,47.1%), MRI (96.6%,51.4%) and EEG (95.8%,40.6%). By combining all items per category, per letter we were able to achieve higher precision, recall and F1-scores of 94.6%, 84.2% and 89.0% across all categories. Conclusion/Implications Our results demonstrate that NLP techniques can be used to accurately extract rich phenotypic details from clinic letters that is often missing from routinely-collected data. Capturing these new data types provides a platform for conducting novel precision neurology research, in addition to potential applicability to other disease areas

    Obtaining structured clinical data from unstructured data using natural language processing software

    Get PDF
    ABSTRACT Background Free text documents in healthcare settings contain a wealth of information not captured in electronic healthcare records (EHRs). Epilepsy clinic letters are an example of an unstructured data source containing a large amount of intricate disease information. Extracting meaningful and contextually correct clinical information from free text sources, to enhance EHRs, remains a significant challenge. SCANR (Swansea University Collaborative in the Analysis of NLP Research) was set up to use natural language processing (NLP) technology to extract structured data from unstructured sources. IBM Watson Content Analytics software (ICA) uses NLP technology. It enables users to define annotations based on dictionaries and language characteristics to create parsing rules that highlight relevant items. These include clinical details such as symptoms and diagnoses, medication and test results, as well as personal identifiers.   Approach To use ICA to build a pipeline to accurately extract detailed epilepsy information from clinic letters. Methods We used ICA to retrieve important epilepsy information from 41 pseudo-anonymized unstructured epilepsy clinic letters. The 41 letters consisted of 13 ‘new’ and 28 ‘follow-up’ letters (for 15 different patients) written by 12 different doctors in different styles. We designed dictionaries and annotators to enable ICA to extract epilepsy type (focal, generalized or unclassified), epilepsy cause, age of onset, investigation results (EEG, CT and MRI), medication, and clinic date. Epilepsy clinicians assessed the accuracy of the pipeline. Results The accuracy (sensitivity, specificity) of each concept was: epilepsy diagnosis 98% (97%, 100%), focal epilepsy 100%, generalized epilepsy 98% (93%, 100%), medication 95% (93%, 100%), age of onset 100% and clinic date 95% (95%, 100%). Precision and recall for each concept were respectively, 98% and 97% for epilepsy diagnosis, 100% each for focal epilepsy, 100% and 93% for generalized epilepsy, 100% each for age of onset, 100% and 93% for medication, 100% and 96% for EEG results, 100% and 83% for MRI scan results, and 100% and 95% for clinic date. Conclusions ICA is capable of extracting detailed, structured epilepsy information from unstructured clinic letters to a high degree of accuracy. This data can be used to populate relational databases and be linked to EHRs. Researchers can build in custom rules to identify concepts of interest from letters and produce structured information. We plan to extend our work to hundreds and then thousands of clinic letters, to provide phenotypically rich epilepsy data to link with other anonymised, routinely collected data

    Validating epilepsy diagnoses in routinely collected data

    Get PDF
    Introduction Primary healthcare records are used for studies within large data repositories. One of the limitations of using these routinely collected data for epilepsy research is the possibility of including incorrectly recorded diagnoses. To our knowledge, the accuracy of UK GP diagnosis codes for epilepsy has only partially been validated. Objectives and Approach We aimed to validate the accuracy of case ascertainment algorithms in identifying people with epilepsy in routinely collected Welsh healthcare data. A reference population of 150 people with definite epilepsy and 150 people without epilepsy was ascertained from hospital records and linked to records held within the Secure Anonymised Information Linkage (SAIL) databank in Wales. We used three different algorithms to identify the reference population: a) individuals with an epilepsy diagnosis code and two consecutive AED prescription codes; b) individuals with an epilepsy diagnosis code only; c) individuals with two consecutive AED prescription codes only. Results We applied the algorithms to all patients and to adults and children separately. For all patients, combining diagnosis and AED prescription codes had a sensitivity of 84% (95% ci 77–90) and specificity of 98% (95–100) in identifying people with epilepsy; diagnosis codes alone had a sensitivity of 86% (80–91) and a specificity of 97% (92–99); and AED prescription codes alone achieved a sensitivity of 92% (70–83) and a specificity of 73% (65–80). Using AED codes only was more accurate in children, achieving a sensitivity of 88% (75–95) and specificity of 98% (88–100). This can be explained by the widespread use of AEDs for indications other than epilepsy in adults, which is not the case for children. Conclusion/Implications GP epilepsy diagnosis and AED prescription codes can be used to identify people with epilepsy using anonymised healthcare records in Wales. In children using AED prescription codes alone is an accurate way to identify epilepsy cases. These results are generalizable to other studies that use UK primary care records
    corecore