19 research outputs found
Automatic Prediction of Recurrence of Major Cardiovascular Events: A Text Mining Study Using Chest X-Ray Reports
Background and Objective. Electronic health records (EHRs) contain free-text information on symptoms, diagnosis, treatment, and prognosis of diseases. However, this potential goldmine of health information cannot be easily accessed and used unless proper text mining techniques are applied. The aim of this project was to develop and evaluate a text mining pipeline in a multimodal learning architecture to demonstrate the value of medical text classification in chest radiograph reports for cardiovascular risk prediction. We sought to assess the integration of various text representation approaches and clinical structured data with state-of-the-art deep learning methods in the process of medical text mining. Methods. We used EHR data of patients included in the Second Manifestations of ARTerial disease (SMART) study. We propose a deep learning-based multimodal architecture for our text mining pipeline that integrates neural text representation with preprocessed clinical predictors for the prediction of recurrence of major cardiovascular events in cardiovascular patients. Text preprocessing, including cleaning and stemming, was first applied to filter out the unwanted texts from X-ray radiology reports. Thereafter, text representation methods were used to numerically represent unstructured radiology reports with vectors. Subsequently, these text representation methods were added to prediction models to assess their clinical relevance. In this step, we applied logistic regression, support vector machine (SVM), multilayer perceptron neural network, convolutional neural network, long short-term memory (LSTM), and bidirectional LSTM deep neural network (BiLSTM). Results. We performed various experiments to evaluate the added value of the text in the prediction of major cardiovascular events. The two main scenarios were the integration of radiology reports (1) with classical clinical predictors and (2) with only age and sex in the case of unavailable clinical predictors. In total, data of 5603 patients were used with 5-fold cross-validation to train the models. In the first scenario, the multimodal BiLSTM (MI-BiLSTM) model achieved an area under the curve (AUC) of 84.7%, misclassification rate of 14.3%, and F1 score of 83.8%. In this scenario, the SVM model, trained on clinical variables and bag-of-words representation, achieved the lowest misclassification rate of 12.2%. In the case of unavailable clinical predictors, the MI-BiLSTM model trained on radiology reports and demographic (age and sex) variables reached an AUC, F1 score, and misclassification rate of 74.5%, 70.8%, and 20.4%, respectively. Conclusions. Using the case study of routine care chest X-ray radiology reports, we demonstrated the clinical relevance of integrating text features and classical predictors in our text mining pipeline for cardiovascular risk prediction. The MI-BiLSTM model with word embedding representation appeared to have a desirable performance when trained on text data integrated with the clinical variables from the SMART study. Our results mined from chest X-ray reports showed that models using text data in addition to laboratory values outperform those using only known clinical predictors
Automatic ICD-10 classification of diseases from Dutch discharge letters
The international classification of diseases (ICD) is a widely used tool to describe patient diagnoses. At University Medical Center Utrecht (UMCU), for example, trained medical coders translate information from hospital discharge letters into ICD-10 codes for research and national disease epidemiology statistics, at considerable cost. To mitigate these costs, automatic ICD coding from discharge letters would be useful. However, this task has proven challenging in practice: it is a multi-label task with a large number of very sparse categories, presented in a hierarchical structure. Moreover, existing ICD systems have been benchmarked only on relatively easier versions of this task, such as single-label performance and performance on the higher âchapterâ level of the ICD hierarchy, which contains fewer categories. In this study, we benchmark the state-of-the-art ICD classification systems and two baseline systems on a large dataset constructed from Dutch cardiology discharge letters a t UMCU hospital. Performance of all systems is evaluated for both the easier chapter-level ICD codes and single-label version of the task found in the literature, as well as for the lower-level ICD hierarchy and multi-label task that is needed in practice. We find that state-of-the-art methods outperform the baseline for the single-label version of the task only. For the multi-label task, the baselines are not defeated by any state-of-the-art system, with the exception of HA-GRU, which does perform best in the most difficult task on accuracy. We conclude that practical performance may have been somewhat overstated in the literature, although deep learning techniques are sufficiently good to complement, though not replace, human ICD coding in our application
Creating HIV risk profiles for men in South Africa: A latent class approach using cross-sectional survey data
Introduction: Engaging atârisk men in HIV prevention programs and services is a current priority, yet there are few effective ways to identify which men are at highest risk or how to best reach them. In this study we generated multiâfactor profiles of HIV acquisition/transmission risk for men in Durban, South Africa, to help inform targeted programming and service delivery. Methods: Data come from surveys with 947 men ages 20 to 40 conducted in two informal settlements from May to September 2017. Using latent class analysis (LCA), which detects a small set of underlying groups based on multiple dimensions, we identified classes based on nine HIV risk factors and socioâdemographic characteristics. We then compared HIV service use between the classes. Results: We identified four latent classes, with good model fit statistics. The older highârisk class (20% of the sample; mean age 36) were more likely married/cohabiting and employed, with multiple sexual partners, substantial ageâdisparity with partners (eight years younger onâaverage), transactional relationships (including more resourceâintensive forms like paying for partnerâs rent), and hazardous drinking. The younger highârisk class (24%; mean age 27) were likely unmarried and employed, with the highest probability of multiple partners in the last year (including 42% with 5+ partners), transactional relationships (less resourceâintensive, e.g., clothes/transportation), hazardous drinking, and inequitable gender views. The younger moderateârisk class (36%; mean age 23) were most likely unmarried, unemployed technical college/university students/graduates. They had a relatively high probability of multiple partners and transactional relationships (less resourceâintensive), and moderate hazardous drinking. Finally, the older lowârisk class (20%; mean age 29) were more likely married/cohabiting, employed, and highly genderâequitable, with few partners and limited transactional relationships. Circumcision (status) was higher among the younger moderateârisk class than either highârisk class (p \u3c 0.001). HIV testing and treatment literacy score were suboptimal and did not differ across classes. Conclusions: Distinct HIV risk profiles among men were identified. Interventions should focus on reaching the highestârisk profiles who, despite their elevated risk, were less or no more likely than the lowerârisk to use HIV services. By enabling a more synergistic understanding of subgroups, LCA has potential to enable more strategic, dataâdriven programming and evaluation
Nation Binding: How Public Service Broadcasting Mitigates Political Selective Exposure
Recent research suggests that more and more citizens select news and information that is congruent with their existing political preferences. This increase in political selective exposure (PSE) has allegedly led to an increase in polarization. The vast majority of studies stem from the US case with a particular media and political system. We contend that there are good reasons to believe PSE is less prevalent in other systems. We test this using latent profile analysis with national survey data from the Netherlands (n = 2,833). We identify four types of media use profiles and indeed only find partial evidence of PSE. In particular, we find that public broadcasting news cross-cuts all cleavages. This research note offers an important antidote in what is considered a universal phenomenon. We do find, however, a relatively large segment of citizens opting out of news consumption despite the readily available news in todayâs media landscape