5 research outputs found
AliClu - Temporal sequence alignment for clustering longitudinal clinical data
The authors acknowledge funding the Portuguese Foundation for Science and Technology (Fundação para a Ciência e a Tecnologia - FCT) under contracts INESC-ID (UID/CEC/50021/2019) and IT (UID/EEA/50008/2019), projects PREDICT (PTDC/CCI-CIF/29877/2017), PERSEIDS (PTDC/EMS-SIS/0642/2014) and NEUROCLINOMICS2 (PTDC/EEI-SII/1937/2014). The funders had no role in the design of the study, collection, analysis and interpretation of data, or writing the manuscript.BACKGROUND: Patient stratification is a critical task in clinical decision making since it can allow physicians to choose treatments in a personalized way. Given the increasing availability of electronic medical records (EMRs) with longitudinal data, one crucial problem is how to efficiently cluster the patients based on the temporal information from medical appointments. In this work, we propose applying the Temporal Needleman-Wunsch (TNW) algorithm to align discrete sequences with the transition time information between symbols. These symbols may correspond to a patient's current therapy, their overall health status, or any other discrete state. The transition time information represents the duration of each of those states. The obtained TNW pairwise scores are then used to perform hierarchical clustering. To find the best number of clusters and assess their stability, a resampling technique is applied. RESULTS: We propose the AliClu, a novel tool for clustering temporal clinical data based on the TNW algorithm coupled with clustering validity assessments through bootstrapping. The AliClu was applied for the analysis of the rheumatoid arthritis EMRs obtained from the Portuguese database of rheumatologic patient visits (Reuma.pt). In particular, the AliClu was used for the analysis of therapy switches, which were coded as letters corresponding to biologic drugs and included their durations before each change occurred. The obtained optimized clusters allow one to stratify the patients based on their temporal therapy profiles and to support the identification of common features for those groups. CONCLUSIONS: The AliClu is a promising computational strategy to analyse longitudinal patient data by providing validated clusters and by unravelling the patterns that exist in clinical outcomes. Patient stratification is performed in an automatic or semi-automatic way, allowing one to tune the alignment, clustering, and validation parameters. The AliClu is freely available at https://github.com/sysbiomed/AliClu.publishersversionpublishe
Prediction Sequence Patterns of Tourist from the Tourism Website by Hybrid Deep Learning Techniques
Tourism is an important industry that generates incomes and jobs in the country where this industry contributes considerably to GDP. Before traveling, tourists usually need to plan an itinerary listing a sequence of where to visit and what to do. To help plan, tourists usually gather information by reading blogs and boards where visitors who have previously traveled posted about traveling places and activities. Text from traveling posts can infer travel itinerary and sequences of places to visit and activities to experience. This research aims to analyze text postings using 21 deep learning techniques to learn sequential patterns of places and activities. The three main techniques are Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU) and a combination of these techniques including their adaptation with batch normalization. The output is sequential patterns for predicting places or activities that tourists are likely to go and plan to do. The results are evaluated using mean absolute error (MAE) and mean squared error (MSE) loss metrics. Moreover, the predicted sequences of places and activities are further assessed using a sequence alignment method called the Needleman–Wunsch algorithm (NW), which is a popular method to estimate sequence matching between two sequences
Unsupervised learning methods for identifying and evaluating disease clusters in electronic health records
Introduction
Clustering algorithms are a class of algorithms that can discover groups of observations in
complex data and are often used to identify subtypes of heterogeneous diseases in electronic
health records (EHR). Evaluating clustering experiments for biological and clinical significance is
a vital but challenging task due to the lack of consensus on best practices. As a result, the
translation of findings from clustering experiments to clinical practice is limited.
Aim
The aim of this thesis was to investigate and evaluate approaches that enable the evaluation of
clustering experiments using EHR.
Methods
We conducted a scoping review of clustering studies in EHR to identify common evaluation
approaches. We systematically investigated the performance of the identified approaches using
a cohort of Alzheimer's Disease (AD) patients as an exemplar comparing four different
clustering methods (K-means, Kernel K-means, Affinity Propagation and Latent Class
Analysis.). Using the same population, we developed and evaluated a method (MCHAMMER)
that tested whether clusterable structures exist in EHR. To develop this method we tested
several cluster validation indexes and methods of generating null data to see which are the best
at discovering clusters. In order to enable the robust benchmarking of evaluation approaches,
we created a tool that generated synthetic EHR data that contain known cluster labels across a
range of clustering scenarios.
Results
Across 67 EHR clustering studies, the most popular internal evaluation metric was comparing
cluster results across multiple algorithms (30% of studies). We examined this approach
conducting a clustering experiment on AD patients using a population of 10,065 AD patients and
21 demographic, symptom and comorbidity features. K-means found 5 clusters, Kernel K means found 2 clusters, Affinity propagation found 5 and latent class analysis found 6. K-means
4
was found to have the best clustering solution with the highest silhouette score (0.19) and was
more predictive of outcomes. The five clusters found were: typical AD (n=2026), non-typical AD
(n=1640), cardiovascular disease cluster (n=686), a cancer cluster (n=1710) and a cluster of
mental health issues, smoking and early disease onset (n=1528), which has been found in
previous research as well as in the results of other clustering methods. We created a synthetic
data generation tool which allows for the generation of realistic EHR clusters that can vary in
separation and number of noise variables to alter the difficulty of the clustering problem. We
found that decreasing cluster separation did increase cluster difficulty significantly whereas
noise variables increased cluster difficulty but not significantly. To develop the tool to assess
clusters existence we tested different methods of null dataset generation and cluster validation
indices, the best performing null dataset method was the min max method and the best
performing indices we Calinksi Harabasz index which had an accuracy of 94%, Davies Bouldin
index (97%) silhouette score ( 93%) and BWC index (90%). We further found that when clusters
were identified using the Calinski Harabasz index they were more likely to have significantly
different outcomes between clusters. Lastly we repeated the initial clustering experiment,
comparing 10 different pre-processing methods. The three best performing methods were RBF
kernel (2 clusters), MCA (4 clusters) and MCA and PCA (6 clusters). The MCA approach gave
the best results highest silhouette score (0.23) and meaningful clusters, producing 4 clusters;
heart and circulatory( n=1379), early onset mental health (n=1761), male cluster with memory
loss (n = 1823), female with more problem (n=2244).
Conclusion
We have developed and tested a series of methods and tools to enable the evaluation of EHR
clustering experiments. We developed and proposed a novel cluster evaluation metric and
provided a tool for benchmarking evaluation approaches in synthetic but realistic EHR
Analysing the Impact of Changes in User Interface of e-Health Record Systems on Clinical Pathways using Process Mining
The provision of care in a hospital includes a series of activities that are often recorded in the electronic health record (EHR) systems. Analysing the data in these EHRs has the potential to support the understanding of care processes and exploring the opportunities for process improvement.
One of the emerging data analytics approaches for such analyses is process mining, and one critical challenge in working with EHR data is that processes might change over time. This thesis uses a process mining approach to detect process change over time and analyse the impact of those changes on the EHR data. The overall aim is to
summarise the attributable change in the data due to the process so that clinicians can better analyse the data.
Three datasets were used in this study to understand the variability of the EHR systems. The first dataset is a publicly available EHR data that was used for developing the methods and supporting the reproducibility of the research. The second dataset is a de-identified subset of the database of cancer patients from the Leeds Cancer Centre. The second dataset was used in the experiments to improve on the results of a previous study using the same dataset. The third dataset was the full Leeds Cancer Centre EHR database after more comprehensive ethics was approved. In the third dataset, experiments were done to analyse the impact of a known system change on clinical pathways and to explore process change over time without a known system change. All three datasets were analysed using process mining.
Process mining was shown to be useful for analysing clinical pathways and exploring process changes over time. It can be used to visualise the process before and after a known change. When the system change is unknown, process mining can be used to explore the process execution over time and identify the potential period where the system was changed. This thesis explores some aspects of the complex interrelatedness of process and user interface (UI) of the EHR system