2,482 research outputs found
Predicting early psychiatric readmission with natural language processing of narrative discharge summaries
The ability to predict psychiatric readmission would facilitate the development of interventions to reduce this risk, a major driver of psychiatric health-care costs. The symptoms or characteristics of illness course necessary to develop reliable predictors are not available in coded billing data, but may be present in narrative electronic health record (EHR) discharge summaries. We identified a cohort of individuals admitted to a psychiatric inpatient unit between 1994 and 2012 with a principal diagnosis of major depressive disorder, and extracted inpatient psychiatric discharge narrative notes. Using these data, we trained a 75-topic Latent Dirichlet Allocation (LDA) model, a form of natural language processing, which identifies groups of words associated with topics discussed in a document collection. The cohort was randomly split to derive a training (70%) and testing (30%) data set, and we trained separate support vector machine models for baseline clinical features alone, baseline features plus common individual words and the above plus topics identified from the 75-topic LDA model. Of 4687 patients with inpatient discharge summaries, 470 were readmitted within 30 days. The 75-topic LDA model included topics linked to psychiatric symptoms (suicide, severe depression, anxiety, trauma, eating/weight and panic) and major depressive disorder comorbidities (infection, postpartum, brain tumor, diarrhea and pulmonary disease). By including LDA topics, prediction of readmission, as measured by area under receiver-operating characteristic curves in the testing data set, was improved from baseline (area under the curve 0.618) to baseline+1000 words (0.682) to baseline+75 topics (0.784). Inclusion of topics derived from narrative notes allows more accurate discrimination of individuals at high risk for psychiatric readmission in this cohort. Topic modeling and related approaches offer the potential to improve prediction using EHRs, if generalizability can be established in other clinical cohorts
Modeling Disease Severity in Multiple Sclerosis Using Electronic Health Records
Objective:
To optimally leverage the scalability and unique features of the electronic health records (EHR) for research that would ultimately improve patient care, we need to accurately identify patients and extract clinically meaningful measures. Using multiple sclerosis (MS) as a proof of principle, we showcased how to leverage routinely collected EHR data to identify patients with a complex neurological disorder and derive an important surrogate measure of disease severity heretofore only available in research settings.
Methods:
In a cross-sectional observational study, 5,495 MS patients were identified from the EHR systems of two major referral hospitals using an algorithm that includes codified and narrative information extracted using natural language processing. In the subset of patients who receive neurological care at a MS Center where disease measures have been collected, we used routinely collected EHR data to extract two aggregate indicators of MS severity of clinical relevance multiple sclerosis severity score (MSSS) and brain parenchymal fraction (BPF, a measure of whole brain volume).
Results:
The EHR algorithm that identifies MS patients has an area under the curve of 0.958, 83% sensitivity, 92% positive predictive value, and 89% negative predictive value when a 95% specificity threshold is used. The correlation between EHR-derived and true MSSS has a mean R[superscript 2] = 0.38±0.05, and that between EHR-derived and true BPF has a mean R[superscript 2] = 0.22±0.08. To illustrate its clinical relevance, derived MSSS captures the expected difference in disease severity between relapsing-remitting and progressive MS patients after adjusting for sex, age of symptom onset and disease duration (p = 1.56×10[superscript −12]).
Conclusion:
Incorporation of sophisticated codified and narrative EHR data accurately identifies MS patients and provides estimation of a well-accepted indicator of MS severity that is widely used in research settings but not part of the routine medical records. Similar approaches could be applied to other complex neurological disorders.National Institute of General Medical Sciences (U.S.) (NIH U54-LM008748
DeepCare: A Deep Dynamic Memory Model for Predictive Medicine
Personalized predictive medicine necessitates the modeling of patient illness
and care processes, which inherently have long-term temporal dependencies.
Healthcare observations, recorded in electronic medical records, are episodic
and irregular in time. We introduce DeepCare, an end-to-end deep dynamic neural
network that reads medical records, stores previous illness history, infers
current illness states and predicts future medical outcomes. At the data level,
DeepCare represents care episodes as vectors in space, models patient health
state trajectories through explicit memory of historical records. Built on Long
Short-Term Memory (LSTM), DeepCare introduces time parameterizations to handle
irregular timed events by moderating the forgetting and consolidation of memory
cells. DeepCare also incorporates medical interventions that change the course
of illness and shape future medical risk. Moving up to the health state level,
historical and present health states are then aggregated through multiscale
temporal pooling, before passing through a neural network that estimates future
outcomes. We demonstrate the efficacy of DeepCare for disease progression
modeling, intervention recommendation, and future risk prediction. On two
important cohorts with heavy social and economic burden -- diabetes and mental
health -- the results show improved modeling and risk prediction accuracy.Comment: Accepted at JBI under the new name: "Predicting healthcare
trajectories from medical records: A deep learning approach
Desiderata for the development of next-generation electronic health record phenotype libraries
Background
High-quality phenotype definitions are desirable to enable the extraction of patient cohorts from large electronic health record repositories and are characterized by properties such as portability, reproducibility, and validity. Phenotype libraries, where definitions are stored, have the potential to contribute significantly to the quality of the definitions they host. In this work, we present a set of desiderata for the design of a next-generation phenotype library that is able to ensure the quality of hosted definitions by combining the functionality currently offered by disparate tooling.
Methods
A group of researchers examined work to date on phenotype models, implementation, and validation, as well as contemporary phenotype libraries developed as a part of their own phenomics communities. Existing phenotype frameworks were also examined. This work was translated and refined by all the authors into a set of best practices.
Results
We present 14 library desiderata that promote high-quality phenotype definitions, in the areas of modelling, logging, validation, and sharing and warehousing.
Conclusions
There are a number of choices to be made when constructing phenotype libraries. Our considerations distil the best practices in the field and include pointers towards their further development to support portable, reproducible, and clinically valid phenotype design. The provision of high-quality phenotype definitions enables electronic health record data to be more effectively used in medical domains
Desiderata for the development of next-generation electronic health record phenotype libraries
BackgroundHigh-quality phenotype definitions are desirable to enable the extraction of patient cohorts from large electronic health record repositories and are characterized by properties such as portability, reproducibility, and validity. Phenotype libraries, where definitions are stored, have the potential to contribute significantly to the quality of the definitions they host. In this work, we present a set of desiderata for the design of a next-generation phenotype library that is able to ensure the quality of hosted definitions by combining the functionality currently offered by disparate tooling.MethodsA group of researchers examined work to date on phenotype models, implementation, and validation, as well as contemporary phenotype libraries developed as a part of their own phenomics communities. Existing phenotype frameworks were also examined. This work was translated and refined by all the authors into a set of best practices.ResultsWe present 14 library desiderata that promote high-quality phenotype definitions, in the areas of modelling, logging, validation, and sharing and warehousing.ConclusionsThere are a number of choices to be made when constructing phenotype libraries. Our considerations distil the best practices in the field and include pointers towards their further development to support portable, reproducible, and clinically valid phenotype design. The provision of high-quality phenotype definitions enables electronic health record data to be more effectively used in medical domains
Recommended from our members
Predicting high-cost care in a mental health setting
Background:
The density of information in digital health records offers new potential opportunities for automated prediction of cost-relevant outcomes.
Aims:
We investigated the extent to which routinely recorded data held in the electronic health record (EHR) predict priority service outcomes and whether natural language processing tools enhance the predictions. We evaluated three high priority outcomes: in-patient duration, readmission following in-patient care and high service cost after first presentation.
Method:
We used data obtained from a clinical database derived from the EHR of a large mental healthcare provider within the UK. We combined structured data with text-derived data relating to diagnosis statements, medication and psychiatric symptomatology. Predictors of the three different clinical outcomes were modelled using logistic regression with performance evaluated against a validation set to derive areas under receiver operating characteristic curves.
Results:
In validation samples, the full models (using all available data) achieved areas under receiver operating characteristic curves between 0.59 and 0.85 (in-patient duration 0.63, readmission 0.59, high service use 0.85). Adding natural language processing-derived data to the models increased the variance explained across all clinical scenarios (observed increase in r2 = 12–46%).
Conclusions:
EHR data offer the potential to improve routine clinical predictions by utilising previously inaccessible data. Of our scenarios, prediction of high service use after initial presentation achieved the highest performance
Natural language processing to extract symptoms of severe mental illness from clinical text: the Clinical Record Interactive Search Comprehensive Data Extraction (CRIS-CODE) project.
OBJECTIVES: We sought to use natural language processing to develop a suite of language models to capture key symptoms of severe mental illness (SMI) from clinical text, to facilitate the secondary use of mental healthcare data in research. DESIGN: Development and validation of information extraction applications for ascertaining symptoms of SMI in routine mental health records using the Clinical Record Interactive Search (CRIS) data resource; description of their distribution in a corpus of discharge summaries. SETTING: Electronic records from a large mental healthcare provider serving a geographic catchment of 1.2 million residents in four boroughs of south London, UK. PARTICIPANTS: The distribution of derived symptoms was described in 23 128 discharge summaries from 7962 patients who had received an SMI diagnosis, and 13 496 discharge summaries from 7575 patients who had received a non-SMI diagnosis. OUTCOME MEASURES: Fifty SMI symptoms were identified by a team of psychiatrists for extraction based on salience and linguistic consistency in records, broadly categorised under positive, negative, disorganisation, manic and catatonic subgroups. Text models for each symptom were generated using the TextHunter tool and the CRIS database. RESULTS: We extracted data for 46 symptoms with a median F1 score of 0.88. Four symptom models performed poorly and were excluded. From the corpus of discharge summaries, it was possible to extract symptomatology in 87% of patients with SMI and 60% of patients with non-SMI diagnosis. CONCLUSIONS: This work demonstrates the possibility of automatically extracting a broad range of SMI symptoms from English text discharge summaries for patients with an SMI diagnosis. Descriptive data also indicated that most symptoms cut across diagnoses, rather than being restricted to particular groups
Processamento automático de texto de narrativas clínicas
The informatization of medical systems and the subsequent move towards
the usage of Electronic Health Records (EHR) over the paper format by
medical professionals allowed for safer and more e cient healthcare. Additionally,
EHR can also be used as a data source for observational studies
around the world. However, it is estimated that 70-80% of all clinical data
is in the form of unstructured free text and regarding the data that is structured,
not all of it follows the same standards, making it di cult to use on
the mentioned observational studies.
This dissertation aims to tackle those two adversities using natural language
processing for the task of extracting concepts from free text and, afterwards,
use a common data model to harmonize the data. The developed system
employs an annotator, namely cTAKES, to extract the concepts from free
text. The extracted concepts are then normalized using text preprocessing,
word embeddings, MetaMap and UMLS Metathesaurus lookup. Finally, the
normalized concepts are converted to the OMOP Common Data Model and
stored in a database.
In order to test the developed system, the i2b2 2010 data set was used.
The di erent components of the system were tested and evaluated separately,
with the concept extraction component achieving a precision, recall
and F-score of 77.12%, 70.29% and 73.55%, respectively. The normalization
component was evaluated by completing the N2C2 2019 challenge
track 3, where it achieved a 77.5% accuracy. Finally, during the OMOP
CDM conversion component, it was observed that 7.92% of the concepts
were lost during the process. In conclusion, even though the developed system
still has margin for improvements, it proves to be a viable method of
automatically processing clinical narratives.A informatização dos sistemas médicos e a subsequente tendência por parte
de profissionais de saúde a substituir registos em formato de papel por registos
eletrónicos de saúde, permitiu que os serviços de saúde se tornassem
mais seguros e eficientes. Além disso, estes registos eletrónicos apresentam
também o benefício de poderem ser utilizados como fonte de dados para estudos
observacionais. No entanto, estima-se que 70-80% de todos os dados
clínicos se encontrem na forma de texto livre não-estruturado e os dados
que estão estruturados não seguem todos os mesmos padrões, dificultando
o seu potencial uso nos estudos observacionais.
Esta dissertação pretende solucionar essas duas adversidades através do uso
de processamento de linguagem natural para a tarefa de extrair conceitos
de texto livre e, de seguida, usar um modelo comum de dados para os harmonizar.
O sistema desenvolvido utiliza um anotador, especificamente o
cTAKES, para extrair conceitos de texto livre. Os conceitos extraídos são,
então, normalizados através de técnicas de pré-processamento de texto,
Word Embeddings, MetaMap e um sistema de procura no Metathesaurus
do UMLS. Por fim, os conceitos normalizados são convertidos para o modelo
comum de dados da OMOP e guardados numa base de dados.
Para testar o sistema desenvolvido usou-se o conjunto de dados i2b2 de
2010. As diferentes partes do sistema foram testadas e avaliadas individualmente
sendo que na extração dos conceitos obteve-se uma precisão, recall e
F-score de 77.12%, 70.29% e 73.55%, respetivamente. A normalização foi
avaliada através do desafio N2C2 2019-track 3 onde se obteve uma exatidão
de 77.5%. Na conversão para o modelo comum de dados OMOP observou-se
que durante a conversão perderam-se 7.92% dos conceitos. Concluiu-se
que, embora o sistema desenvolvido ainda tenha margem para melhorias,
este demonstrou-se como um método viável de processamento automático
do texto de narrativas clínicas.Mestrado em Engenharia de Computadores e Telemátic
The Secure Anonymised Information Linkage databank Dementia e-cohort (SAIL-DeC)
Introduction:
The rising burden of dementia is a global concern, and there is a need to study its causes, natural history and outcomes. The Secure Anonymised Information Linkage (SAIL) Databank contains anonymised, routinely-collected healthcare data for the population of Wales, UK. It has potential to be a valuable resource for dementia research owing to its size, long follow-up time and prospective collection of data during clinical care.
Objectives:We aimed to apply reproducible methods to create the SAIL dementia e-cohort (SAIL-DeC). We created SAIL-DeC with a view to maximising its utility for a broad range of research questions whilst minimising duplication of effort for researchers. Methods:SAIL contains individual-level, linked primary care, hospital admission, mortality and demographic data. Data are currently available until 2018 and future updates will extend participant follow-up time. We included participants who were born between 1st January 1900 and 1st January 1958 and for whom primary care data were available. We applied algorithms consisting of International Classification of Diseases (versions 9 and 10) and Read (version 2) codes to identify participants with and without all-cause dementia and dementia subtypes. We also created derived variables for comorbidities and risk factors.
Results:From 4.4 million unique participants in SAIL, 1.2 million met the cohort inclusion criteria, resulting in 18.8 million person-years of follow-up. Of these, 129,650 (10%) developed all-cause dementia, with 77,978 (60%) having dementia subtype codes. Alzheimer's disease was the most common subtype diagnosis (62%). Among the dementia cases, the median duration of observation time was 14 years.
Conclusion:We have created a generalisable, national dementia e-cohort, aimed at facilitating epidemiological dementia research
- …