427 research outputs found

    The k-means algorithm: A comprehensive survey and performance evaluation

    Get PDF
    © 2020 by the authors. Licensee MDPI, Basel, Switzerland. The k-means clustering algorithm is considered one of the most powerful and popular data mining algorithms in the research community. However, despite its popularity, the algorithm has certain limitations, including problems associated with random initialization of the centroids which leads to unexpected convergence. Additionally, such a clustering algorithm requires the number of clusters to be defined beforehand, which is responsible for different cluster shapes and outlier effects. A fundamental problem of the k-means algorithm is its inability to handle various data types. This paper provides a structured and synoptic overview of research conducted on the k-means algorithm to overcome such shortcomings. Variants of the k-means algorithms including their recent developments are discussed, where their effectiveness is investigated based on the experimental analysis of a variety of datasets. The detailed experimental analysis along with a thorough comparison among different k-means clustering algorithms differentiates our work compared to other existing survey papers. Furthermore, it outlines a clear and thorough understanding of the k-means algorithm along with its different research directions

    An AI-based Intelligent System for Healthcare Analysis Using Ridge–Adaline Stochastic Gradient Descent Classifier

    Get PDF
    Recent technological advancements in information and communication technologies introduced smart ways of handling various aspects of life. Smart devices and applications are now an integral part of our daily life; however, the use of smart devices also introduced various physical and psychological health issues in modern societies. One of the most common health care issues prevalent among almost all age groups is diabetes mellitus. This work aims to propose an Artificial Intelligence (AI) – based intelligent system for earlier prediction of the disease using Ridge Adaline Stochastic Gradient Descent Classifier (RASGD). The proposed scheme RASGD improves the regularization of the classification model by using weight decay methods, namely Least Absolute Shrinkage and Selection Operator(LASSO) and Ridge Regression methods. To minimize the cost function of the classifier, the RASGD adopts an unconstrained optimization model. Further, to increase the convergence speed of the classifier, the Adaline Stochastic Gradient Descent classifier is integrated with Ridge Regression. Finally, to validate the effectiveness of the intelligent system, the results of the proposed scheme have been compared with state-of-art machine learning algorithms such as Support Vector Machine and Logistic Regression methods. The RASGD intelligent system attains an accuracy of 92%, which is better than the other selected classifiers

    Hybrid Software Reliability Model for Big Fault Data and Selection of Best Optimizer Using an Estimation Accuracy Function

    Get PDF
    Software reliability analysis has come to the forefront of academia as software applications have grown in size and complexity. Traditionally, methods have focused on minimizing coding errors to guarantee analytic tractability. This causes the estimations to be overly optimistic when using these models. However, it is important to take into account non-software factors, such as human error and hardware failure, in addition to software faults to get reliable estimations. In this research, we examine how big data systems' peculiarities and the need for specialized hardware led to the creation of a hybrid model. We used statistical and soft computing approaches to determine values for the model's parameters, and we explored five criteria values in an effort to identify the most useful method of parameter evaluation for big data systems. For this purpose, we conduct a case study analysis of software failure data from four actual projects. In order to do a comparison, we used the precision of the estimation function for the results. Particle swarm optimization was shown to be the most effective optimization method for the hybrid model constructed with the use of large-scale fault data

    Analysis of physiological signals using machine learning methods

    Get PDF
    Technological advances in data collection enable scientists to suggest novel approaches, such as Machine Learning algorithms, to process and make sense of this information. However, during this process of collection, data loss and damage can occur for reasons such as faulty device sensors or miscommunication. In the context of time-series data such as multi-channel bio-signals, there is a possibility of losing a whole channel. In such cases, existing research suggests imputing the missing parts when the majority of data is available. One way of understanding and classifying complex signals is by using deep neural networks. The hyper-parameters of such models have been optimised using the process of back propagation. Over time, improvements have been suggested to enhance this algorithm. However, an essential drawback of the back propagation can be the sensitivity to noisy data. This thesis proposes two novel approaches to address the missing data challenge and back propagation drawbacks: First, suggesting a gradient-free model in order to discover the optimal hyper-parameters of a deep neural network. The complexity of deep networks and high-dimensional optimisation parameters presents challenges to find a suitable network structure and hyper-parameter configuration. This thesis proposes the use of a minimalist swarm optimiser, Dispersive Flies Optimisation(DFO), to enable the selected model to achieve better results in comparison with the traditional back propagation algorithm in certain conditions such as limited number of training samples. The DFO algorithm offers a robust search process for finding and determining the hyper-parameter configurations. Second, imputing whole missing bio-signals within a multi-channel sample. This approach comprises two experiments, namely the two-signal and five-signal imputation models. The first experiment attempts to implement and evaluate the performance of a model mapping bio-signals from A toB and vice versa. Conceptually, this is an extension to transfer learning using CycleGenerative Adversarial Networks (CycleGANs). The second experiment attempts to suggest a mechanism imputing missing signals in instances where multiple data channels are available for each sample. The capability to map to a target signal through multiple source domains achieves a more accurate estimate for the target domain. The results of the experiments performed indicate that in certain circumstances, such as having a limited number of samples, finding the optimal hyper-parameters of a neural network using gradient-free algorithms outperforms traditional gradient-based algorithms, leading to more accurate classification results. In addition, Generative Adversarial Networks could be used to impute the missing data channels in multi-channel bio-signals, and the generated data used for further analysis and classification tasks

    Performance evaluation of random forest with feature selection methods in prediction of diabetes

    Get PDF
    Data mining is nothing but the process of viewing data in different angle and compiling it into appropriate information. Recent improvements in the area of data mining and machine learning have empowered the research in biomedical field to improve the condition of general health care. Since the wrong classification may lead to poor prediction, there is a need to perform the better classification which further improves the prediction rate of the medical datasets. When medical data mining is applied on the medical datasets the important and difficult challenges are the classification and prediction. In this proposed work we evaluate the PIMA Indian Diabtes data set of UCI repository using machine learning algorithm like Random Forest along with feature selection methods such as forward selection and backward elimination based on entropy evaluation method using percentage split as test option. The experiment was conducted using R studio platform and we achieved classification accuracy of 84.1%. From results we can say that Random Forest predicts diabetes better than other techniques with less number of attributes so that one can avoid least important test for identifying diabetes

    Crafting Adversarial Examples using Particle Swarm Optimization

    Get PDF
    Machine learning models have been found to be vulnerable to adversarial attacks that apply small perturbations to input samples to get them misclassified. Attacks that search for and apply the perturbations are performed in both white-box and black-box settings, depending on the information available to the attacker about the target. For black-box attacks, the attacker can only query the target with specially crafted inputs and observing the outputs returned by the model. These outputs are used to guide the perturbations and create adversarial examples that are then misclassified. Current black-box attacks on API-based malware classifiers rely solely on feature insertion when applying perturbations. This restriction is set in place to ensure that no changes are introduced to the malware\u27s originally intended functionality. Additionally, the API calls being inserted in the malware are null or no-op APIs that have no functional affect to avoid any unintentional impact on malware behavior. Due to the nature of these API calls, they can be easily detected through non-ML techniques by analyzing their arguments and return values. In this dissertation, we explore other attacks on API-based malware detection models that are not restricted to feature addition. Specifically, we explore feature replacement as a possible avenue for creating adversarial malware examples. To retain the malware\u27s original functionality, we replace API calls with other functionally equivalent API calls. We find the API alternatives by using a hierarchical unsupervised learning approach on the API\u27s documentation. Our attack, which we call AdversarialPSO, uses Particle Swarm Optimization to guide the perturbations according to available function alternatives. Results show that creating adversarial malware examples by feature replacement is possible even under the more restrictive search space of limited function alternatives. Unlike the malware domain, which lacks benchmark datasets and publicly available classification models, image classification has multiple benchmarks to test new attacks. Therefore, to evaluate the efficacy and wide-applicability of AdversarialPSO, we re-implement the attack in the image classification domain, where we create adversarial examples from images by adding small often unrecognizable perturbations to the inputs. As a result of these perturbations, highly-accurate models misclassify the inputs resulting in a drastic drop in their accuracy. We evaluate this attack against both defended and undefended models and show that AdversarialPSO performs comparably to state-of-the-art adversarial attacks
    • …
    corecore