19 research outputs found

    Automatic Prediction of Recurrence of Major Cardiovascular Events: A Text Mining Study Using Chest X-Ray Reports

    Get PDF
    Background and Objective. Electronic health records (EHRs) contain free-text information on symptoms, diagnosis, treatment, and prognosis of diseases. However, this potential goldmine of health information cannot be easily accessed and used unless proper text mining techniques are applied. The aim of this project was to develop and evaluate a text mining pipeline in a multimodal learning architecture to demonstrate the value of medical text classification in chest radiograph reports for cardiovascular risk prediction. We sought to assess the integration of various text representation approaches and clinical structured data with state-of-the-art deep learning methods in the process of medical text mining. Methods. We used EHR data of patients included in the Second Manifestations of ARTerial disease (SMART) study. We propose a deep learning-based multimodal architecture for our text mining pipeline that integrates neural text representation with preprocessed clinical predictors for the prediction of recurrence of major cardiovascular events in cardiovascular patients. Text preprocessing, including cleaning and stemming, was first applied to filter out the unwanted texts from X-ray radiology reports. Thereafter, text representation methods were used to numerically represent unstructured radiology reports with vectors. Subsequently, these text representation methods were added to prediction models to assess their clinical relevance. In this step, we applied logistic regression, support vector machine (SVM), multilayer perceptron neural network, convolutional neural network, long short-term memory (LSTM), and bidirectional LSTM deep neural network (BiLSTM). Results. We performed various experiments to evaluate the added value of the text in the prediction of major cardiovascular events. The two main scenarios were the integration of radiology reports (1) with classical clinical predictors and (2) with only age and sex in the case of unavailable clinical predictors. In total, data of 5603 patients were used with 5-fold cross-validation to train the models. In the first scenario, the multimodal BiLSTM (MI-BiLSTM) model achieved an area under the curve (AUC) of 84.7%, misclassification rate of 14.3%, and F1 score of 83.8%. In this scenario, the SVM model, trained on clinical variables and bag-of-words representation, achieved the lowest misclassification rate of 12.2%. In the case of unavailable clinical predictors, the MI-BiLSTM model trained on radiology reports and demographic (age and sex) variables reached an AUC, F1 score, and misclassification rate of 74.5%, 70.8%, and 20.4%, respectively. Conclusions. Using the case study of routine care chest X-ray radiology reports, we demonstrated the clinical relevance of integrating text features and classical predictors in our text mining pipeline for cardiovascular risk prediction. The MI-BiLSTM model with word embedding representation appeared to have a desirable performance when trained on text data integrated with the clinical variables from the SMART study. Our results mined from chest X-ray reports showed that models using text data in addition to laboratory values outperform those using only known clinical predictors

    Automatic ICD-10 classification of diseases from Dutch discharge letters

    Get PDF
    The international classification of diseases (ICD) is a widely used tool to describe patient diagnoses. At University Medical Center Utrecht (UMCU), for example, trained medical coders translate information from hospital discharge letters into ICD-10 codes for research and national disease epidemiology statistics, at considerable cost. To mitigate these costs, automatic ICD coding from discharge letters would be useful. However, this task has proven challenging in practice: it is a multi-label task with a large number of very sparse categories, presented in a hierarchical structure. Moreover, existing ICD systems have been benchmarked only on relatively easier versions of this task, such as single-label performance and performance on the higher “chapter” level of the ICD hierarchy, which contains fewer categories. In this study, we benchmark the state-of-the-art ICD classification systems and two baseline systems on a large dataset constructed from Dutch cardiology discharge letters a t UMCU hospital. Performance of all systems is evaluated for both the easier chapter-level ICD codes and single-label version of the task found in the literature, as well as for the lower-level ICD hierarchy and multi-label task that is needed in practice. We find that state-of-the-art methods outperform the baseline for the single-label version of the task only. For the multi-label task, the baselines are not defeated by any state-of-the-art system, with the exception of HA-GRU, which does perform best in the most difficult task on accuracy. We conclude that practical performance may have been somewhat overstated in the literature, although deep learning techniques are sufficiently good to complement, though not replace, human ICD coding in our application

    Creating HIV risk profiles for men in South Africa: A latent class approach using cross-sectional survey data

    No full text
    Introduction: Engaging at‐risk men in HIV prevention programs and services is a current priority, yet there are few effective ways to identify which men are at highest risk or how to best reach them. In this study we generated multi‐factor profiles of HIV acquisition/transmission risk for men in Durban, South Africa, to help inform targeted programming and service delivery. Methods: Data come from surveys with 947 men ages 20 to 40 conducted in two informal settlements from May to September 2017. Using latent class analysis (LCA), which detects a small set of underlying groups based on multiple dimensions, we identified classes based on nine HIV risk factors and socio‐demographic characteristics. We then compared HIV service use between the classes. Results: We identified four latent classes, with good model fit statistics. The older high‐risk class (20% of the sample; mean age 36) were more likely married/cohabiting and employed, with multiple sexual partners, substantial age‐disparity with partners (eight years younger on‐average), transactional relationships (including more resource‐intensive forms like paying for partner’s rent), and hazardous drinking. The younger high‐risk class (24%; mean age 27) were likely unmarried and employed, with the highest probability of multiple partners in the last year (including 42% with 5+ partners), transactional relationships (less resource‐intensive, e.g., clothes/transportation), hazardous drinking, and inequitable gender views. The younger moderate‐risk class (36%; mean age 23) were most likely unmarried, unemployed technical college/university students/graduates. They had a relatively high probability of multiple partners and transactional relationships (less resource‐intensive), and moderate hazardous drinking. Finally, the older low‐risk class (20%; mean age 29) were more likely married/cohabiting, employed, and highly gender‐equitable, with few partners and limited transactional relationships. Circumcision (status) was higher among the younger moderate‐risk class than either high‐risk class (p \u3c 0.001). HIV testing and treatment literacy score were suboptimal and did not differ across classes. Conclusions: Distinct HIV risk profiles among men were identified. Interventions should focus on reaching the highest‐risk profiles who, despite their elevated risk, were less or no more likely than the lower‐risk to use HIV services. By enabling a more synergistic understanding of subgroups, LCA has potential to enable more strategic, data‐driven programming and evaluation

    Nation Binding: How Public Service Broadcasting Mitigates Political Selective Exposure

    Get PDF
    Recent research suggests that more and more citizens select news and information that is congruent with their existing political preferences. This increase in political selective exposure (PSE) has allegedly led to an increase in polarization. The vast majority of studies stem from the US case with a particular media and political system. We contend that there are good reasons to believe PSE is less prevalent in other systems. We test this using latent profile analysis with national survey data from the Netherlands (n = 2,833). We identify four types of media use profiles and indeed only find partial evidence of PSE. In particular, we find that public broadcasting news cross-cuts all cleavages. This research note offers an important antidote in what is considered a universal phenomenon. We do find, however, a relatively large segment of citizens opting out of news consumption despite the readily available news in today’s media landscape