    Multiple Imputation based Clustering Validation (MIV) for Big Longitudinal Trial Data with Missing Values in eHealth

    Web-delivered trials are an important component in eHealth services. These trials, mostly behavior-based, generate big heterogeneous data that are longitudinal, high dimensional with missing values. Unsupervised learning methods have been widely applied in this area, however, validating the optimal number of clusters has been challenging. Built upon our multiple imputation (MI) based fuzzy clustering, MIfuzzy, we proposed a new multiple imputation based validation (MIV) framework and corresponding MIV algorithms for clustering big longitudinal eHealth data with missing values, more generally for fuzzy-logic based clustering methods. Specifically, we detect the optimal number of clusters by auto-searching and -synthesizing a suite of MI-based validation methods and indices, including conventional (bootstrap or cross-validation based) and emerging (modularity-based) validation indices for general clustering methods as well as the specific one (Xie and Beni) for fuzzy clustering. The MIV performance was demonstrated on a big longitudinal dataset from a real web-delivered trial and using simulation. The results indicate MI-based Xie and Beni index for fuzzy-clustering are more appropriate for detecting the optimal number of clusters for such complex data. The MIV concept and algorithms could be easily adapted to different types of clustering that could process big incomplete longitudinal trial data in eHealth services

    A New MI-Based Visualization Aided Validation Index for Mining Big Longitudinal Web Trial Data

    Web-delivered clinical trials generate big complex data. To help untangle the heterogeneity of treatment effects, unsupervised learning methods have been widely applied. However, identifying valid patterns is a priority but challenging issue for these methods. This paper, built upon our previous research on multiple imputation (MI)-based fuzzy clustering and validation, proposes a new MI-based Visualization-aided validation index (MIVOOS) to determine the optimal number of clusters for big incomplete longitudinal Web-trial data with inflated zeros. Different from a recently developed fuzzy clustering validation index, MIVOOS uses a more suitable overlap and separation measures for Web-trial data but does not depend on the choice of fuzzifiers as the widely used Xie and Beni (XB) index. Through optimizing the view angles of 3-D projections using Sammon mapping, the optimal 2-D projection-guided MIVOOS is obtained to better visualize and verify the patterns in conjunction with trajectory patterns. Compared with XB and VOS, our newly proposed MIVOOS shows its robustness in validating big Web-trial data under different missing data mechanisms using real and simulated Web-trial data

    Systematic Review on Missing Data Imputation Techniques with Machine Learning Algorithms for Healthcare

    Missing data is one of the most common issues encountered in data cleaning process especially when dealing with medical dataset. A real collected dataset is prone to be incomplete, inconsistent, noisy and redundant due to potential reasons such as human errors, instrumental failures, and adverse death. Therefore, to accurately deal with incomplete data, a sophisticated algorithm is proposed to impute those missing values. Many machine learning algorithms have been applied to impute missing data with plausible values. However, among all machine learning imputation algorithms, KNN algorithm has been widely adopted as an imputation for missing data due to its robustness and simplicity and it is also a promising method to outperform other machine learning methods. This paper provides a comprehensive review of different imputation techniques used to replace the missing data. The goal of the review paper is to bring specific attention to potential improvements to existing methods and provide readers with a better grasps of imputation technique trends

    Vol. 15, No. 1 (Full Issue)

    An overview of clustering methods with guidelines for application in mental health research

    Cluster analyzes have been widely used in mental health research to decompose inter-individual heterogeneity by identifying more homogeneous subgroups of individuals. However, despite advances in new algorithms and increasing popularity, there is little guidance on model choice, analytical framework and reporting requirements. In this paper, we aimed to address this gap by introducing the philosophy, design, advantages/disadvantages and implementation of major algorithms that are particularly relevant in mental health research. Extensions of basic models, such as kernel methods, deep learning, semi-supervised clustering, and clustering ensembles are subsequently introduced. How to choose algorithms to address common issues as well as methods for pre-clustering data processing, clustering evaluation and validation are then discussed. Importantly, we also provide general guidance on clustering workflow and reporting requirements. To facilitate the implementation of different algorithms, we provide information on R functions and librarie

    Information technologies for pain management

    Millions of people around the world suffer from pain, acute or chronic and this raises the importance of its screening, assessment and treatment. The importance of pain is attested by the fact that it is considered the fifth vital sign for indicating basic bodily functions, health and quality of life, together with the four other vital signs: blood pressure, body temperature, pulse rate and respiratory rate. However, while these four signals represent an objective physical parameter, the occurrence of pain expresses an emotional status that happens inside the mind of each individual and therefore, is highly subjective that makes difficult its management and evaluation. For this reason, the self-report of pain is considered the most accurate pain assessment method wherein patients should be asked to periodically rate their pain severity and related symptoms. Thus, in the last years computerised systems based on mobile and web technologies are becoming increasingly used to enable patients to report their pain which lead to the development of electronic pain diaries (ED). This approach may provide to health care professionals (HCP) and patients the ability to interact with the system anywhere and at anytime thoroughly changes the coordinates of time and place and offers invaluable opportunities to the healthcare delivery. However, most of these systems were designed to interact directly to patients without presence of a healthcare professional or without evidence of reliability and accuracy. In fact, the observation of the existing systems revealed lack of integration with mobile devices, limited use of web-based interfaces and reduced interaction with patients in terms of obtaining and viewing information. In addition, the reliability and accuracy of computerised systems for pain management are rarely proved or their effects on HCP and patients outcomes remain understudied. This thesis is focused on technology for pain management and aims to propose a monitoring system which includes ubiquitous interfaces specifically oriented to either patients or HCP using mobile devices and Internet so as to allow decisions based on the knowledge obtained from the analysis of the collected data. With the interoperability and cloud computing technologies in mind this system uses web services (WS) to manage data which are stored in a Personal Health Record (PHR). A Randomised Controlled Trial (RCT) was implemented so as to determine the effectiveness of the proposed computerised monitoring system. The six weeks RCT evidenced the advantages provided by the ubiquitous access to HCP and patients so as to they were able to interact with the system anywhere and at anytime using WS to send and receive data. In addition, the collected data were stored in a PHR which offers integrity and security as well as permanent on line accessibility to both patients and HCP. The study evidenced not only that the majority of participants recommend the system, but also that they recognize it suitability for pain management without the requirement of advanced skills or experienced users. Furthermore, the system enabled the definition and management of patient-oriented treatments with reduced therapist time. The study also revealed that the guidance of HCP at the beginning of the monitoring is crucial to patients' satisfaction and experience stemming from the usage of the system as evidenced by the high correlation between the recommendation of the application, and it suitability to improve pain management and to provide medical information. There were no significant differences regarding to improvements in the quality of pain treatment between intervention group and control group. Based on the data collected during the RCT a clinical decision support system (CDSS) was developed so as to offer capabilities of tailored alarms, reports, and clinical guidance. This CDSS, called Patient Oriented Method of Pain Evaluation System (POMPES), is based on the combination of several statistical models (one-way ANOVA, Kruskal-Wallis and Tukey-Kramer) with an imputation model based on linear regression. This system resulted in fully accuracy related to decisions suggested by the system compared with the medical diagnosis, and therefore, revealed it suitability to manage the pain. At last, based on the aerospace systems capability to deal with different complex data sources with varied complexities and accuracies, an innovative model was proposed. This model is characterized by a qualitative analysis stemming from the data fusion method combined with a quantitative model based on the comparison of the standard deviation together with the values of mathematical expectations. This model aimed to compare the effects of technological and pen-and-paper systems when applied to different dimension of pain, such as: pain intensity, anxiety, catastrophizing, depression, disability and interference. It was observed that pen-and-paper and technology produced equivalent effects in anxiety, depression, interference and pain intensity. On the contrary, technology evidenced favourable effects in terms of catastrophizing and disability. The proposed method revealed to be suitable, intelligible, easy to implement and low time and resources consuming. Further work is needed to evaluate the proposed system to follow up participants for longer periods of time which includes a complementary RCT encompassing patients with chronic pain symptoms. Finally, additional studies should be addressed to determine the economic effects not only to patients but also to the healthcare system

    Phenotyping Risk Profiles of Substance Use and Exploring the Dynamic Transitions in Use Patterns: Machine Learning Models using the COMPASS Data

    Background Polysubstance use is on the rise among Canadian youth. Examining risk profiles and understanding how the transition occurs in use patterns can inform the design and implementation of polysubstance risk reduction intervention. The COMPASS study is longitudinal research examining health-related behaviours among Canadian secondary school students, capturing data from multiple sources. Machine learning (ML) techniques can reveal non-linearity and multivariate couplings associated with population-level longitudinal data to inform public health policies. Objectives The overarching goal of this thesis is to identify phenotypes of risk profiles of youth polysubstance use and examine the dynamic transitions of use patterns across time, utilizing both unsupervised ML methods and a latent variable modelling approach. This thesis also aims to understand how ML techniques are best used in modelling transitions and discovering the “hidden” patterns from large complex population-based health survey data, using the COMPASS dataset as a showcase. Methods A linked sample (N = 8824) of three annual waves of the COMPASS data collected starting from the school year of 2016-17 was used. Multiple imputations for missing values were performed. Substance use indicators, including cigarette smoking, e-cigarette use, alcohol drinking, and marijuana consumption, were categorized into “never use,” “occasional use,” and “current use.” To examine phenotypes of risk profiles, hierarchical clustering, partitioning around medoids (PAM), and fuzzy clustering algorithms were applied. The Boruta algorithm was used to identify a subset of features for cluster analysis. Both the internal and external indices were employed to evaluate the clustering validity. A multivariate latent Markov model (LMM) was implemented to explore the dynamic transitions of use patterns over time. The least absolute shrinkage and selection operator (LASSO) approach was applied to select the appropriate covariates for entering the LMM. Model selection was based on the Bayesian information criterion (BIC) and the goodness-of-fit test. Results The top factors impacting youth polysubstance use included the number of smoking friends, the number of skipped classes, the weekly money to spend/save oneself, and others. Four risk profiles of polysubstance use were identified across the three waves: low, medium-low, medium-high, and high-risk profiles. The heterogeneity in the prevalence and phenotype across these four risk profiles was confirmed. The internal measures of clustering performance measured by average silhouette width ranged from 0.51 to 0.55 across the three waves using different clustering algorithms. The clustering algorithms achieved a relatively high degree of agreement on cluster membership. Comparing the fuzzy (FANNY) clustering with PAM clustering, the adjusted Rand indices were 0.9698, 0.7676, and 0.6452 for the three waves. Four distinct use patterns were identified: no use (S1), occasional single-use of alcohol (S2), dual-use of e-cigarette and alcohol (S3), and current multi-use (S4). The initial probabilities of each subgroup were 0.5887, 0.2156, 0.1487, and 0.0470. The marginal distribution of S1 decreased, while that of S3 and S4 increased over time, indicating a tendency towards increased substance use as the students grew older. Although, generally, most students remained in the same subgroup across time, particularly the individuals in S4 with the highest transition probability (0.8668). Over time, those who transitioned typically moved towards a more severe use pattern group, e.g., S3 -> S4. Factors that impact the initial membership of use patterns and the dynamic transitions were multifaceted and complex across the four use patterns across the three waves. Not only do use patterns change with time, but so does the evidence in use patterns. Conclusion As the first study of its kind to ascertain risk profiles and dynamics of use patterns in youth polysubstance use, by employing ML approaches to the COMPASS dataset, this thesis provides insights into the opportunities and possibilities ahead for ML in Public Health. Findings from this thesis can be beneficial to practitioners in the field, such as school program managers or policymakers, in their capacity to develop interventions to prevent or remedy polysubstance use among youth

    OsamorSoft: clustering index for comparison and quality validation in high throughput dataset

    The existence of some differences in the results obtained from varying clustering k-means algorithms necessitated the need for a simplified approach in validation of cluster quality obtained. This is partly because of differences in the way the algorithms select their first seed or centroid either randomly, sequentially or some other principles influences which tend to influence the final result outcome. Popular external cluster quality validation and comparison models require the computation of varying clustering indexes such as Rand, Jaccard, Fowlkes and Mallows, Morey and Agresti Adjusted Rand Index (ARIMA) and Hubert and Arabie Adjusted Rand Index (ARIHA). In literature, Hubert and Arabie Adjusted Rand Index (ARIHA) has been adjudged as a good measure of cluster validity. Based on ARIHA as a popular clustering quality index, we developed OsamorSoft which constitutes DNA_Omatrix and OsamorSpreadSheet as a tool for cluster quality validation in high throughput analysis. The proposed method will help to bridge the yawning gap created by lesser number of friendly tools available to externally evaluate the ever-increasing number of clustering algorithms. Our implementation was tested alongside with clusters created with four k-means algorithms using malaria microarray data. Furthermore, our results evolved a compact 4-stage OsamorSpreadSheet statistics that our easy-to-use GUI java and spreadsheet-based tool of OsamorSoft uses for cluster quality comparison. It is recommended that a framework be evolved to facilitate the simplified integration and automation of several other cluster validity indexes for comparative analysis of big data problems

    Impact of Temporal Order Selection on Clustering Intensive Longitudinal Data Based on VAR Models

    In real-world research, intensive longitudinal data (ILDs) are typically collected from a group of individuals of interest, which enables researchers to model not only the within-individual dynamics of the studied processes but also the between-individual differences on the within-individual dynamics. Among the statistical techniques proposed for modeling ILDs of multiple individuals, clustering of intensive longitudinal data provides a meaningful way to quantify sample heterogeneity in dynamic processes, assuming that such heterogeneity reflects the distinct nature of the studied processes. The aims of this dissertation are threefold: (a) to introduce a VAR-based clustering technique, (b) to examine the impact of temporal order selection on clustering accuracy and parameter estimation by a simulation study, and (c) to demonstrate the application of the clustering technique through an empirical analysis. Specially, I investigated the influence of two temporal order selection strategies: (1) using the most complex structure or highest order (HO) for all individual processes, and (2) using the most parsimonious structure or the lowest order (LO) for all individuals on the performance of two-step model-based clustering procedure. This procedure extracted dynamic coefficients from vector autoregressive (VAR) models and employed the Gaussian mixture model (GMM) and K-means clustering algorithms on the coefficients for cluster identification. Additionally, I also examined whether the influence varied across two clustering algorithms. The simulation study showed that, regardless of the clustering algorithms used, LO strategy consistently outperformed HO strategy in terms of recovering the number of clusters, cluster membership, and cluster-specific AR and CR effects. GMM performed better than K-means when LO strategy was applied; however, the performance of GMM decreased while the temporal orders increased. Additionally, GMM showed more vulnerability with smaller numbers of participants. The application of the two-step VAR-based method to affect data yielded a meaningful and informative clustering solution, which provided further insights of the uses of the model-based clustering approach Lastly, suggestions and recommendations were offered based on the results of the simulation and empirical analyses


    The book collects the short papers presented at the 13th Scientific Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society (SIS). The meeting has been organized by the Department of Statistics, Computer Science and Applications of the University of Florence, under the auspices of the Italian Statistical Society and the International Federation of Classification Societies (IFCS). CLADAG is a member of the IFCS, a federation of national, regional, and linguistically-based classification societies. It is a non-profit, non-political scientific organization, whose aims are to further classification research