812 research outputs found

    Clustering heterogeneous categorical data using enhanced mini batch K-means with entropy distance measure

    Get PDF
    Clustering methods in data mining aim to group a set of patterns based on their similarity. In a data survey, heterogeneous information is established with various types of data scales like nominal, ordinal, binary, and Likert scales. A lack of treatment of heterogeneous data and information leads to loss of information and scanty decision-making. Although many similarity measures have been established, solutions for heterogeneous data in clustering are still lacking. The recent entropy distance measure seems to provide good results for the heterogeneous categorical data. However, it requires many experiments and evaluations. This article presents a proposed framework for heterogeneous categorical data solution using a mini batch k-means with entropy measure (MBKEM) which is to investigate the effectiveness of similarity measure in clustering method using heterogeneous categorical data. Secondary data from a public survey was used. The findings demonstrate the proposed framework has improved the clustering’s quality. MBKEM outperformed other clustering algorithms with the accuracy at 0.88, v-measure (VM) at 0.82, adjusted rand index (ARI) at 0.87, and Fowlkes-Mallow’s index (FMI) at 0.94. It is observed that the average minimum elapsed time-varying for cluster generation, k at 0.26 s. In the future, the proposed solution would be beneficial for improving the quality of clustering for heterogeneous categorical data problems in many domains

    Significance-Based Categorical Data Clustering

    Full text link
    Although numerous algorithms have been proposed to solve the categorical data clustering problem, how to access the statistical significance of a set of categorical clusters remains unaddressed. To fulfill this void, we employ the likelihood ratio test to derive a test statistic that can serve as a significance-based objective function in categorical data clustering. Consequently, a new clustering algorithm is proposed in which the significance-based objective function is optimized via a Monte Carlo search procedure. As a by-product, we can further calculate an empirical pp-value to assess the statistical significance of a set of clusters and develop an improved gap statistic for estimating the cluster number. Extensive experimental studies suggest that our method is able to achieve comparable performance to state-of-the-art categorical data clustering algorithms. Moreover, the effectiveness of such a significance-based formulation on statistical cluster validation and cluster number estimation is demonstrated through comprehensive empirical results.Comment: 36 pages, 6 figure

    Predictive Modelling Approach to Data-Driven Computational Preventive Medicine

    Get PDF
    This thesis contributes novel predictive modelling approaches to data-driven computational preventive medicine and offers an alternative framework to statistical analysis in preventive medicine research. In the early parts of this research, this thesis presents research by proposing a synergy of machine learning methods for detecting patterns and developing inexpensive predictive models from healthcare data to classify the potential occurrence of adverse health events. In particular, the data-driven methodology is founded upon a heuristic-systematic assessment of several machine-learning methods, data preprocessing techniques, models’ training estimation and optimisation, and performance evaluation, yielding a novel computational data-driven framework, Octopus. Midway through this research, this thesis advances research in preventive medicine and data mining by proposing several new extensions in data preparation and preprocessing. It offers new recommendations for data quality assessment checks, a novel multimethod imputation (MMI) process for missing data mitigation, a novel imbalanced resampling approach, and minority pattern reconstruction (MPR) led by information theory. This thesis also extends the area of model performance evaluation with a novel classification performance ranking metric called XDistance. In particular, the experimental results show that building predictive models with the methods guided by our new framework (Octopus) yields domain experts' approval of the new reliable models’ performance. Also, performing the data quality checks and applying the MMI process led healthcare practitioners to outweigh predictive reliability over interpretability. The application of MPR and its hybrid resampling strategies led to better performances in line with experts' success criteria than the traditional imbalanced data resampling techniques. Finally, the use of the XDistance performance ranking metric was found to be more effective in ranking several classifiers' performances while offering an indication of class bias, unlike existing performance metrics The overall contributions of this thesis can be summarised as follow. First, several data mining techniques were thoroughly assessed to formulate the new Octopus framework to produce new reliable classifiers. In addition, we offer a further understanding of the impact of newly engineered features, the physical activity index (PAI) and biological effective dose (BED). Second, the newly developed methods within the new framework. Finally, the newly accepted developed predictive models help detect adverse health events, namely, visceral fat-associated diseases and advanced breast cancer radiotherapy toxicity side effects. These contributions could be used to guide future theories, experiments and healthcare interventions in preventive medicine and data mining

    New methods for discovering local behaviour in mixed databases

    Full text link
    Clustering techniques are widely used. There are many applications where it is desired to find automatically groups or hidden information in the data set. Finding a model of the system based in the integration of several local models is placed among other applications. Local model could have many structures; however, a linear structure is the most common one, due to its simplicity. This work aims at finding improvements in several fields, but all them will be applied to this finding of a set of local models in a database. On the one hand, a way of codifying the categorical information into numerical values has been designed, in order to apply a numerical algorithm to the whole data set. On the other hand, a cost index has been developed, which will be optimized globally, to find the parameters of the local clusters that best define the output of the process. Each of the techniques has been applied to several experiments and results show the improvements over the actual techniques.Barceló Rico, F. (2009). New methods for discovering local behaviour in mixed databases. http://hdl.handle.net/10251/12739Archivo delegad

    Delivering Reliable AI to Clinical Contexts: Addressing the Challenge of Missing Data

    Get PDF
    Clinical data are essential in the medical domain, ensuring quality of care and improving decision-making. However, their heterogeneous and incomplete nature leads to an ubiquity of data quality problems, particularly missing values. Inevitable challenges arise in delivering reliable Decision Support Systems (DSSs), as missing data yield negative effects on the learning process of Machine Learning models. The interest in developing missing value imputation strategies has been growing, in an endeavour to overcome this issue. This dissertation aimed to study missing data and their relationships with observed values, and to lateremploy that information in a technique that addresses the predicaments posed by incomplete datasets in real-world scenarios. Moreover, the concept of correlation was explored within the context of missing value imputation, a promising but rather overlooked approach in biomedical research. First, a comprehensive correlational study was performed, which considered key aspects from missing data analysis. Afterwards, the gathered knowledge was leveraged to create three novel correlation-based imputation techniques. Thesewere not only validated on datasets with a controlled and synthetic missingness, but also on real-world medical datasets. Their performance was evaluated against competing imputation methods, both traditional and state-of-the-art. The contributions of this dissertation encompass a systematic view of theoretical concepts regarding the analysis and handling of missing values. Additionally, an extensive literature review concerning missing data imputation was conducted, which comprised a comparative study of ten methods under diverse missingness conditions. The proposed techniques exhibited similar results when compared to their competitors, sometimes even superior in terms of imputation precision and classification performance, evaluated through the Mean Absolute Error and the Area Under the Receiver Operating Characteristic curve, respectively. Therefore, this dissertation corroborates the potential of correlation to improve the robustness of DSSs to missing values, and provides answers to current flaws shared by correlation-based imputation strategies in real-world medical problems.Dados clínicos são essenciais para assegurar cuidados médicos de qualidade e melhorar a tomada de decisões. Contudo, a sua natureza heterogénea e incompleta cria uma ubiquidade de problemas de qualidade, nomeadamente pela existência de valores em falta. Esta condição origina desafios inevitáveis para a disponibilização de Sistemas de Apoio à Decisão (SADs) fiáveis, já que dados em falta acarretam efeitos negativos no treino de modelos de Aprendizagem Automática. O interesse no desenvolvimento de estratégias de imputação de valores em falta tem vindo a crescer, num esforço para superar esta adversidade. Esta dissertação visou estudar o problema dos dados em falta através das relações que estes apresentam com os valores observados. Esta informação foi depois utilizada no desenvolvimento de técnicas para colmatar os problemas impostos por dados incompletos em cenários reais. Ademais, o conceito de correlação foi explorado no contexto da imputação de valores em falta, já que, apesar de promissor, tem vindo a ser negligenciado em investigação biomédica. Em primeiro lugar, foi realizado um estudo correlacional abrangente que contemplou aspetos fundamentais da análise de dados em falta. Posteriormente, o conhecimento recolhido foi aplicado na criação de três novas técnicas de imputação baseadas na correlação. Estas foram validadas não só em conjuntos de dados com incompletude controlada e sintética, mas também em conjuntos de dados médicos reais. O seu desempenho foi avaliado e comparado a métodos de imputação tanto tradicionais como de estado-de-arte. As contribuições desta dissertação passam pela sistematização de conceitos teóricos relativos à análise e tratamento de dados em falta. Adicionalmente, realizou-se uma extensa revisão da literatura referente à imputação de dados, que compreendeu um estudo comparativo de dez métodos sob diversas condições de incompletude. As técnicas propostas exibiram resultados semelhantes aos dos restantes métodos, por vezes até superiores em termos de precisão da imputação e de performance da classificação. Assim, esta dissertação corrobora o potencial da utilização da correlação na melhoria da robustez de SADs a dados em falta, e fornece respostas a algumas das atuais falhas partilhadas por estratégias de imputação baseadas em correlação quando aplicadas a casos médicos reais
    corecore