110 research outputs found

    Clustering Mixed Numeric and Categorical Data with Cuckoo Search

    Get PDF

    The k-means algorithm: A comprehensive survey and performance evaluation

    Get PDF
    © 2020 by the authors. Licensee MDPI, Basel, Switzerland. The k-means clustering algorithm is considered one of the most powerful and popular data mining algorithms in the research community. However, despite its popularity, the algorithm has certain limitations, including problems associated with random initialization of the centroids which leads to unexpected convergence. Additionally, such a clustering algorithm requires the number of clusters to be defined beforehand, which is responsible for different cluster shapes and outlier effects. A fundamental problem of the k-means algorithm is its inability to handle various data types. This paper provides a structured and synoptic overview of research conducted on the k-means algorithm to overcome such shortcomings. Variants of the k-means algorithms including their recent developments are discussed, where their effectiveness is investigated based on the experimental analysis of a variety of datasets. The detailed experimental analysis along with a thorough comparison among different k-means clustering algorithms differentiates our work compared to other existing survey papers. Furthermore, it outlines a clear and thorough understanding of the k-means algorithm along with its different research directions

    Analysis of Dimensionality Reduction Techniques on Big Data

    Get PDF
    Due to digitization, a huge volume of data is being generated across several sectors such as healthcare, production, sales, IoT devices, Web, organizations. Machine learning algorithms are used to uncover patterns among the attributes of this data. Hence, they can be used to make predictions that can be used by medical practitioners and people at managerial level to make executive decisions. Not all the attributes in the datasets generated are important for training the machine learning algorithms. Some attributes might be irrelevant and some might not affect the outcome of the prediction. Ignoring or removing these irrelevant or less important attributes reduces the burden on machine learning algorithms. In this work two of the prominent dimensionality reduction techniques, Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are investigated on four popular Machine Learning (ML) algorithms, Decision Tree Induction, Support Vector Machine (SVM), Naive Bayes Classifier and Random Forest Classifier using publicly available Cardiotocography (CTG) dataset from University of California and Irvine Machine Learning Repository. The experimentation results prove that PCA outperforms LDA in all the measures. Also, the performance of the classifiers, Decision Tree, Random Forest examined is not affected much by using PCA and LDA.To further analyze the performance of PCA and LDA the eperimentation is carried out on Diabetic Retinopathy (DR) and Intrusion Detection System (IDS) datasets. Experimentation results prove that ML algorithms with PCA produce better results when dimensionality of the datasets is high. When dimensionality of datasets is low it is observed that the ML algorithms without dimensionality reduction yields better results

    Fuzzy Distance Measure Based Affinity Propagation Clustering

    Get PDF
    Affinity Propagation (AP) is an effective algorithm that find exemplars repeatedly exchange real valued messages between pairs of data points. AP uses the similarity between data points to calculate the messages. Hence, the construction of similarity is essential in the AP algorithm. A common choice for similarity is the negative Euclidean distance. However, due to the simplicity of Euclidean distance, it cannot capture the real structure of data. Furthermore, Euclidean distance is sensitive to noise and outliers such that the performance of the AP might be degraded. Therefore, researchers have intended to utilize different similarity measures to analyse the performance of AP. nonetheless, there is still a room to enhance the performance of AP clustering. A clustering method called fuzzy based Affinity propagation (F-AP) is proposed, which is based on a fuzzy similarity measure. Experiments shows the efficiency of the proposed F-AP, experiments is performed on UCI dataset. Results shows a promising improvement on AP

    The Categorical Data Conundrum: Heuristics for Classification Problems A Case Study on Domestic Fire Injuries

    Get PDF
    Machine learning is well developed amongst the scientific community in terms of theoretical foundations (statistics and algorithms) and frameworks (Tensorflow, PyTorch, H2O). However, machine learning is heavily focused on numerical data, or numerical data mixed with some categorical data. For numerical datasets, scientists and engineers can enjoy reasonable success with only a limited knowledge of theoretical foundations and the inner workings of machine learning frameworks. However, it is a different story when dealing with purely categorical datasets, which require a deeper understanding of machine learning frameworks and associated encodings and algorithms in order to achieve success. This paper addresses the issues in handling purely categorical datasets for multi-classification problems and provides a set of heuristics for dealing with purely categorical data. In particular, issues such as pre-processing, feature encoding and algorithm selection are considered. The heuristics are then demonstrated through a case study, based on a categorical data set of domestic fire injuries, covering a 10-year period. Novel contributions are made through the heuristics and the performance analysis of different encoding techniques. The case study itself also makes a novel contribution through the classification of different types of injuries, based on related features

    Machine learning techniques implementation in power optimization, data processing, and bio-medical applications

    Get PDF
    The rapid progress and development in machine-learning algorithms becomes a key factor in determining the future of humanity. These algorithms and techniques were utilized to solve a wide spectrum of problems extended from data mining and knowledge discovery to unsupervised learning and optimization. This dissertation consists of two study areas. The first area investigates the use of reinforcement learning and adaptive critic design algorithms in the field of power grid control. The second area in this dissertation, consisting of three papers, focuses on developing and applying clustering algorithms on biomedical data. The first paper presents a novel modelling approach for demand side management of electric water heaters using Q-learning and action-dependent heuristic dynamic programming. The implemented approaches provide an efficient load management mechanism that reduces the overall power cost and smooths grid load profile. The second paper implements an ensemble statistical and subspace-clustering model for analyzing the heterogeneous data of the autism spectrum disorder. The paper implements a novel k-dimensional algorithm that shows efficiency in handling heterogeneous dataset. The third paper provides a unified learning model for clustering neuroimaging data to identify the potential risk factors for suboptimal brain aging. In the last paper, clustering and clustering validation indices are utilized to identify the groups of compounds that are responsible for plant uptake and contaminant transportation from roots to plants edible parts --Abstract, page iv
    • …
    corecore