Search CORE

110 research outputs found

Clustering Mixed Numeric and Categorical Data with Cuckoo Search

Author: Feng Guozhong
He Fei
Ji Jinchao
Li Zairong
Pang Wei
Zhao Xiaowei
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2020
Field of study

The k-means algorithm: A comprehensive survey and performance evaluation

Author: Ahmed Mohiuddin
Islam Syed Mohammed Shamsul
Seraj Raihan
Publication venue: Edith Cowan University, Research Online, Perth, Western Australia
Publication date: 01/08/2020
Field of study

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. The k-means clustering algorithm is considered one of the most powerful and popular data mining algorithms in the research community. However, despite its popularity, the algorithm has certain limitations, including problems associated with random initialization of the centroids which leads to unexpected convergence. Additionally, such a clustering algorithm requires the number of clusters to be defined beforehand, which is responsible for different cluster shapes and outlier effects. A fundamental problem of the k-means algorithm is its inability to handle various data types. This paper provides a structured and synoptic overview of research conducted on the k-means algorithm to overcome such shortcomings. Variants of the k-means algorithms including their recent developments are discussed, where their effectiveness is investigated based on the experimental analysis of a variety of datasets. The detailed experimental analysis along with a thorough comparison among different k-means clustering algorithms differentiates our work compared to other existing survey papers. Furthermore, it outlines a clear and thorough understanding of the k-means algorithm along with its different research directions

Research Online @ ECU

Analysis of Dimensionality Reduction Techniques on Big Data

Author: Baker T
Kaluri R
Kumar Reddy M P
Lakshmanna K
Reddy G T
Singh Rajput D
Srivastava G
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Due to digitization, a huge volume of data is being generated across several sectors such as healthcare, production, sales, IoT devices, Web, organizations. Machine learning algorithms are used to uncover patterns among the attributes of this data. Hence, they can be used to make predictions that can be used by medical practitioners and people at managerial level to make executive decisions. Not all the attributes in the datasets generated are important for training the machine learning algorithms. Some attributes might be irrelevant and some might not affect the outcome of the prediction. Ignoring or removing these irrelevant or less important attributes reduces the burden on machine learning algorithms. In this work two of the prominent dimensionality reduction techniques, Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are investigated on four popular Machine Learning (ML) algorithms, Decision Tree Induction, Support Vector Machine (SVM), Naive Bayes Classifier and Random Forest Classifier using publicly available Cardiotocography (CTG) dataset from University of California and Irvine Machine Learning Repository. The experimentation results prove that PCA outperforms LDA in all the measures. Also, the performance of the classifiers, Decision Tree, Random Forest examined is not affected much by using PCA and LDA.To further analyze the performance of PCA and LDA the eperimentation is carried out on Diabetic Retinopathy (DR) and Intrusion Detection System (IDS) datasets. Experimentation results prove that ML algorithms with PCA produce better results when dimensionality of the datasets is high. When dimensionality of datasets is low it is observed that the ML algorithms without dimensionality reduction yields better results

LJMU Research Online (Liverpool John Moores University)

Intelligent Analysis for Multi-Level Data-Driven Prediction

Author: Li Zhenpeng
Publication venue
Publication date: 01/01/2019
Field of study

Aberystwyth Research Portal

Fuzzy Distance Measure Based Affinity Propagation Clustering

Author: Al-Akash Omar Mahmoud Nayef
Azmi Mohd Sanusi
Syed Ahmad Sharifah Sakinah
Publication venue: 'Research India Publications'
Publication date: 01/01/2018
Field of study

Affinity Propagation (AP) is an effective algorithm that find exemplars repeatedly exchange real valued messages between pairs of data points. AP uses the similarity between data points to calculate the messages. Hence, the construction of similarity is essential in the AP algorithm. A common choice for similarity is the negative Euclidean distance. However, due to the simplicity of Euclidean distance, it cannot capture the real structure of data. Furthermore, Euclidean distance is sensitive to noise and outliers such that the performance of the AP might be degraded. Therefore, researchers have intended to utilize different similarity measures to analyse the performance of AP. nonetheless, there is still a room to enhance the performance of AP clustering. A clustering method called fuzzy based Affinity propagation (F-AP) is proposed, which is based on a fuzzy similarity measure. Experiments shows the efficiency of the proposed F-AP, experiments is performed on UCI dataset. Results shows a promising improvement on AP

Universiti Teknikal Malaysia Melaka (UTeM) Repository

The Categorical Data Conundrum: Heuristics for Classification Problems A Case Study on Domestic Fire Injuries

Author: Chalmers C
Fergus P
Reilly D
Taylor M
Thompson S
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Machine learning is well developed amongst the scientific community in terms of theoretical foundations (statistics and algorithms) and frameworks (Tensorflow, PyTorch, H2O). However, machine learning is heavily focused on numerical data, or numerical data mixed with some categorical data. For numerical datasets, scientists and engineers can enjoy reasonable success with only a limited knowledge of theoretical foundations and the inner workings of machine learning frameworks. However, it is a different story when dealing with purely categorical datasets, which require a deeper understanding of machine learning frameworks and associated encodings and algorithms in order to achieve success. This paper addresses the issues in handling purely categorical datasets for multi-classification problems and provides a set of heuristics for dealing with purely categorical data. In particular, issues such as pre-processing, feature encoding and algorithm selection are considered. The heuristics are then demonstrated through a case study, based on a categorical data set of domestic fire injuries, covering a 10-year period. Novel contributions are made through the heuristics and the performance analysis of different encoding techniques. The case study itself also makes a novel contribution through the classification of different types of injuries, based on related features

LJMU Research Online (Liverpool John Moores University)

Machine learning techniques implementation in power optimization, data processing, and bio-medical applications

Author: Al-Jabery Khalid Khairullah Mezied
Publication venue: Scholars\u27 Mine
Publication date: 01/01/2018
Field of study

The rapid progress and development in machine-learning algorithms becomes a key factor in determining the future of humanity. These algorithms and techniques were utilized to solve a wide spectrum of problems extended from data mining and knowledge discovery to unsupervised learning and optimization. This dissertation consists of two study areas. The first area investigates the use of reinforcement learning and adaptive critic design algorithms in the field of power grid control. The second area in this dissertation, consisting of three papers, focuses on developing and applying clustering algorithms on biomedical data. The first paper presents a novel modelling approach for demand side management of electric water heaters using Q-learning and action-dependent heuristic dynamic programming. The implemented approaches provide an efficient load management mechanism that reduces the overall power cost and smooths grid load profile. The second paper implements an ensemble statistical and subspace-clustering model for analyzing the heterogeneous data of the autism spectrum disorder. The paper implements a novel k-dimensional algorithm that shows efficiency in handling heterogeneous dataset. The third paper provides a unified learning model for clustering neuroimaging data to identify the potential risk factors for suboptimal brain aging. In the last paper, clustering and clustering validation indices are utilized to identify the groups of compounds that are responsible for plant uptake and contaminant transportation from roots to plants edible parts --Abstract, page iv

Missouri University of Science and Technology (Missouri S&T): Scholars' Mine

Recommended from our members

Scheduling and Resource Efficiency Balancing. Discrete Species Conserving Cuckoo Search for Scheduling in an Uncertain Execution Environment

Author: Bibiks Kirils
Publication venue: Faculty of Engineering and Informatics
Publication date: 01/01/2017
Field of study

The main goal of a scheduling process is to decide when and how to execute each of the project’s activities. Despite large variety of researched scheduling problems, the majority of them can be described as generalisations of the resource-constrained project scheduling problem (RCPSP). Because of wide applicability and challenging difficulty, RCPSP has attracted vast amount of attention in the research community and great variety of heuristics have been adapted for solving it. Even though these heuristics are structurally different and operate according to diverse principles, they are designed to obtain only one solution at a time. In the recent researches on RCPSPs, it was proven that these kind of problems have complex multimodal fitness landscapes, which are characterised by a wide solution search spaces and presence of multiple local and global optima. The main goal of this thesis is twofold. Firstly, it presents a variation of the RCPSP that considers optimisation of projects in an uncertain environment where resources are modelled to adapt to their environment and, as the result of this, improve their efficiency. Secondly, modification of a novel evolutionary computation method Cuckoo Search (CS) is proposed, which has been adapted for solving combinatorial optimisation problems and modified to obtain multiple solutions. To test the proposed methodology, two sets of experiments are carried out. Firstly, the developed algorithm is applied to a real-life software development project. Secondly, the performance of the algorithm is tested on universal benchmark instances for scheduling problems which were modified to take into account specifics of the proposed optimisation model. The results of both experiments demonstrate that the proposed methodology achieves competitive level of performance and is capable of finding multiple global solutions, as well as prove its applicability in real-life projects

Bradford Scholars