20 research outputs found

    Unsupervised machine learning of integrated health and social care data from the Macmillan Improving the Cancer Journey service in Glasgow

    Get PDF
    Background: Improving the Cancer Journey (ICJ) was launched in 2014 by Glasgow City Council and Macmillan Cancer Support. As part of routine service, data is collected on ICJ users including demographic and health information, results from holistic needs assessments and quality of life scores as measured by EQ-5D health status. There is also data on the number and type of referrals made and feedback from users on the overall service. By applying artificial intelligence and interactive visualization technologies to this data, we seek to improve service provision and optimize resource allocation.Method: An unsupervised machine-learning algorithm was deployed to cluster the data. The classical k-means algorithm was extended with the k-modes technique for categorical data, and the gap heuristic automatically identified the number of clusters. The resulting clusters are used to summarize complex data sets and produce three-dimensional visualizations of the data landscape. Furthermore, the traits of new ICJ clients are predicted by approximately matching their details to the nearest existing cluster center.Results: Cross-validation showed the model’s effectiveness over a wide range of traits. For example, the model can predict marital status, employment status and housing type with an accuracy between 2.4 to 4.8 times greater than random selection. One of the most interesting preliminary findings is that area deprivation (measured through Scottish Index of Multiple Deprivation-SIMD) is a better predictor of an ICJ client’s needs than primary diagnosis (cancer type).Conclusion: A key strength of this system is its ability to rapidly ingest new data on its own and derive new predictions from those data. This means the model can guide service provision by forecasting demand based on actual or hypothesized data. The aim is to provide intelligent person-centered recommendations. The machine-learning model described here is part of a prototype software tool currently under development for use by the cancer support community.Disclosure: Funded by Macmillan Cancer Support</p

    An Enhanced Initialization Method to Find an Initial Center for K-modes Clustering

    Get PDF
    Data mining is a technique which extracts the information from the large amount of data. To group the objects having similar characteristics, clustering method is used. K-means clustering algorithm is very efficient for large data sets deals with numerical quantities however it not works well for real world data sets which contain categorical values for most of the attributes. K-modes algorithm is used in the place of K-means algorithm. In the existing system, the initialization of K- modes clustering from the view of outlier detection is considered. It avoids that various initial cluster centers come from the same cluster. To overcome the above said limitation, it uses Initial_Distance and Initial_Entropy algorithms which use a new weightage formula to calculate the degree of outlierness of each object. K-modes algorithm can guarantee that the chosen initial cluster centers are not outliers. To improve the performance further, a new modified distance metric -weighted matching distance is used to calculate the distance between two objects during the process of initialization. As well as, one of the data pre-processing methods is used to improve the quality of data. Experiments are carried out on several data sets from UCI repository and the results demonstrated the effectiveness of the initialization method in the proposed algorithm

    Optimal mathematical programming and variable neighborhood search for k-modes categorical data clustering

    Get PDF
    The conventional k-modes algorithm and its variants have been extensively used for categorical data clustering. However, these algorithms have some drawbacks, e.g., they can be trapped into local optima and sensitive to initial clusters/modes. Our numerical experiments even showed that the k-modes algorithm could not identify the optimal clustering results for some special datasets regardless the selection of the initial centers. In this paper, we developed an integer linear programming (ILP) approach for the k-modes clustering, which is independent to the initial solution and can obtain directly the optimal results for small-sized datasets. We also developed a heuristic algorithm that implements iterative partial optimization in the ILP approach based on a framework of variable neighborhood search, known as IPO-ILP-VNS, to search for near-optimal results of medium and large sized datasets with controlled computing time. Experiments on 38 datasets, including 27 synthesized small datasets and 11 known benchmark datasets from the UCI site were carried out to test the proposed ILP approach and the IPO-ILP-VNS algorithm. The experimental results outperformed the conventional and other existing enhanced k-modes algorithms in literature, updated 9 of the UCI benchmark datasets with new and improved results

    A fair-multicluster approach to clustering of categorical data

    Get PDF
    In the last few years, the need of preventing classification biases due to race, gender, social status, etc. has increased the interest in designing fair clustering algorithms. The main idea is to ensure that the output of a cluster algorithm is not biased towards or against specific subgroups of the population. There is a growing specialized literature on this topic, dealing with the problem of clustering numerical data bases. Nevertheless, to our knowledge, there are no previous papers devoted to the problem of fair clustering of pure categorical attributes. In this paper, we show that the Multicluster methodology proposed by Santos and Heras (Interdiscip J Inf Knowl Manag 15:227–246, 2020. https://doi.org/10.28945/4643) for clustering categorical data, can be modified in order to increase the fairness of the clusters. Of course, there is a tradeoff between fairness and efficiency, so that an increase in the fairness objective usually leads to a loss of classification efficiency. Yet it is possible to reach a reasonable compromise between these goals, since the methodology proposed by Santos and Heras (2020) can be easily adapted in order to get homogeneous and fair clusters

    Congruence between latent class and k-modes analyses in the identification of oncology patients with distinct symptom experiences

    Get PDF
    CONTEXT: Risk profiling of oncology patients based on their symptom experience assists clinicians to provide more personalized symptom management interventions. Recent findings suggest that oncology patients with distinct symptom profiles can be identified using a variety of analytic methods. OBJECTIVES: The objective of this study was to evaluate the concordance between the number and types of subgroups of patients with distinct symptom profiles using latent class analysis and K-modes analysis. METHODS: Using data on the occurrence of 25 symptoms from the Memorial Symptom Assessment Scale, that 1329 patients completed prior to their next dose of chemotherapy (CTX), Cohen's kappa coefficient was used to evaluate for concordance between the two analytic methods. For both latent class analysis and K-modes, differences among the subgroups in demographic, clinical, and symptom characteristics, as well as quality of life outcomes were determined using parametric and nonparametric statistics. RESULTS: Using both analytic methods, four subgroups of patients with distinct symptom profiles were identified (i.e., all low, moderate physical and lower psychological, moderate physical and higher Psychological, and all high). The percent agreement between the two methods was 75.32%, which suggests a moderate level of agreement. In both analyses, patients in the all high group were significantly younger and had a higher comorbidity profile, worse Memorial Symptom Assessment Scale subscale scores, and poorer QOL outcomes. CONCLUSION: Both analytic methods can be used to identify subgroups of oncology patients with distinct symptom profiles. Additional research is needed to determine which analytic methods and which dimension of the symptom experience provide the most sensitive and specific risk profile
    corecore