33 research outputs found

    Application of Synthetic Informative Minority Over-Sampling (SIMO) Algorithm Leveraging Support Vector Machine (SVM) On Small Datasets with Class Imbalance

    Get PDF
    Developing predictive models for classification problems considering imbalanced datasets is one of the basic difficulties in data mining and decision-analytics. A classifier’s performance will decline dramatically when applied to an imbalanced dataset. Standard classifiers such as logistic regression, Support Vector Machine (SVM) are appropriate for balanced training sets whereas provides suboptimal classification results when used on unbalanced dataset. Performance metric with prediction accuracy encourages a bias towards the majority class, while the rare instances remain unknown though the model contributes a high overall precision. There are chances where minority instances might be treated as noise and vice versa. (Haixiang et al., 2017). Wide range of Class Imbalanced learning techniques are introduced to overcome the above-mentioned problems, although each has some advantages and shortcomings. This paper provides details on the behavior of a novel imbalanced learning technique Synthetic Informative Minority Over-Sampling (SIMO) Algorithm Leveraging Support Vector Machine (SVM) on small datasets of records less than 200. Base classifiers, Logistic regression and SVM is used to validate the impact of SIMO on classifier’s performance in terms of metrices G-mean and Area Under Curve. A Comparison is derived between SIMO and other algorithms SMOTE, Smote-Borderline, ADAYSN to evaluate performance of SIMO over others

    An advance extended binomial GLMBoost ensemble method with synthetic minority over-sampling technique for handling imbalanced datasets

    Get PDF
    Classification is an important activity in a variety of domains. Class imbalance problem have reduced the performance of the traditional classification approaches. An imbalance problem arises when mismatched class distributions are discovered among the instances of class of classification datasets. An advance extended binomial GLMBoost (EBGLMBoost) coupled with synthetic minority over-sampling technique (SMOTE) technique is the proposed model in the study to manage imbalance issues. The SMOTE is used to solve the proposed model, ensuring that the target variable's distribution is balanced, whereas the GLMBoost ensemble techniques are built to deal with imbalanced datasets. For the entire experiment, twenty different datasets are used, and support vector machine (SVM), Nu-SVM, bagging, and AdaBoost classification algorithms are used to compare with the suggested method. The model's sensitivity, specificity, geometric mean (G-mean), precision, recall, and F-measure resulted in percentages for training and testing datasets are 99.37, 66.95, 80.81, 99.21, 99.37, 99.29 and 98.61, 54.78, 69.88, 98.77, 96.61, 98.68, respectively. With the help of the Wilcoxon test, it is determined that the proposed technique performed well on unbalanced data. Finally, the proposed solutions are capable of efficiently dealing with the problem of class imbalance

    Improved adaptive semi-unsupervised weighted oversampling (IA-SUWO) using sparsity factor for imbalanced datasets

    Get PDF
    The imbalanced data problem is common in data mining nowadays due to the skewed nature of data, which impact the classification process negatively in machine learning. For preprocessing, oversampling techniques significantly benefitted the imbalanced domain, in which artificial data is generated in minority class to enhance the number of samples and balance the distribution of samples in both classes. However, existing oversampling techniques encounter through overfitting and over-generalization problems which lessen the classifier performance. Although many clustering based oversampling techniques significantly overcome these problems but most of these techniques are not able to produce the appropriate number of synthetic samples in minority clusters. This study proposed an improved Adaptive Semi-unsupervised Weighted Oversampling (IA-SUWO) technique, using the sparsity factor which determine the sparse minority samples in each minority cluster. This technique consider the sparse minority samples which are far from the decision boundary. These samples also carry the important information for learning of minority class, if these samples are also considered for oversampling, imbalance ratio will be more reduce also it could enhance the learnability of the classifiers. The outcomes of the proposed approach have been compared with existing oversampling techniques such as SMOTE, Borderline-SMOTE, Safe-level SMOTE, and standard A-SUWO technique in terms of accuracy. As aforementioned, the comparative analysis revealed that the proposed oversampling approach performance increased in average by 5% from 85% to 90% than the existing comparative techniques

    Predicción de rotación de clientes en la industria de las telecomunicaciones utilizando métodos de minería de datos

    Get PDF
    At present, in competitive space between companies and organizations, customers churn is their most important challenge. When a customer becomes churn, organizations lose one of their most important assets, which can lead to financial losses and even bankruptcy.  Customer churn prediction using data mining techniques can alleviate these problems to some extent.  The aim of the present study is to provide a hybrid method based on Genetic Algorithm and Modular Neural Network to customer churn prediction in telecommunication industries and use Irancell data as a sample. The accuracy result of this study which is 95.5% get the highest accuracy rank in comparisons with the result of other methods, which shows using modular neural network with two modules of feedforward neural network and also using genetic algorithm to obtain optimal structure for modules of the neural network are the most important indicators of this method to each the highest accuracy result among the rest of methods.At present, in competitive space between companies and organizations, customers churn is their most important challenge. When a customer becomes churn, organizations lose one of their most important assets, which can lead to financial losses and even bankruptcy.  Customer churn prediction using data mining techniques can alleviate these problems to some extent.  The aim of the present study is to provide a hybrid method based on Genetic Algorithm and Modular Neural Network to customer churn prediction in telecommunication industries and use Irancell data as a sample. The accuracy result of this study which is 95.5% get the highest accuracy rank in comparisons with the result of other methods, which shows using modular neural network with two modules of feedforward neural network and also using genetic algorithm to obtain optimal structure for modules of the neural network are the most important indicators of this method to each the highest accuracy result among the rest of methods

    A Comparison of Re-Sampling Techniques for Detection of Multi-Step Attacks on Deep Learning Models

    Get PDF
    The increasing dependence on data analytics and artificial intelligence (AI) methodologies across various domains has prompted the emergence of apprehensions over data security and integrity. There exists a consensus among scholars and experts that the identification and mitigation of Multi-step attacks pose significant challenges due to the intricate nature of the diverse approaches utilized. This study aims to address the issue of imbalanced datasets within the domain of Multi-step attack detection. To achieve this objective, the research explores three distinct re-sampling strategies, namely over-sampling, under-sampling, and hybrid re-sampling techniques. The study offers a comprehensive assessment of several re-sampling techniques utilized in the detection of Multi-step attacks on deep learning (DL) models. The efficacy of the solution is evaluated using a Multi-step cyber attack dataset that emulates attacks across six attack classes. Furthermore, the performance of several re-sampling approaches with numerous traditional machine learning (ML) and deep learning (DL) models are compared, based on performance metrics such as accuracy, precision, recall, F-1 score, and G-mean. In contrast to preliminary studies, the research focuses on Multi-step attack detection. The results indicate that the combination of Convolutional Neural Networks (CNN) with Deep Belief Networks (DBN), Long Short-Term Memory (LSTM), and Recurrent Neural Networks (RNN) provides optimal results as compared to standalone ML/DL models. Moreover, the results also depict that SMOTEENN, a hybrid re-sampling technique, demonstrates superior effectiveness in enhancing detection performance across various models and evaluation metrics. The findings indicate the significance of appropriate re-sampling techniques to improve the efficacy of Multi-step attack detection on DL models

    Breast cancer classification using machine learning techniques: a comparative study

    Get PDF
    Background: The second leading deadliest disease affecting women worldwide, after  lung cancer, is breast cancer. Traditional approaches for breast cancer diagnosis suffer from time consumption and some human errors in classification. To deal with this problems, many research works based on machine learning techniques are proposed.  These approaches show  their effectiveness in data classification in many fields, especially in healthcare.      Methods: In this cross sectional study, we conducted a practical comparison between the most used machine learning algorithms in the literature. We applied kernel and linear support vector machines, random forest, decision tree, multi-layer perceptron, logistic regression, and k-nearest neighbors for breast cancer tumors classification.  The used dataset is  Wisconsin diagnosis Breast Cancer. Results: After comparing the machine learning algorithms efficiency, we noticed that multilayer perceptron and logistic regression gave  the best results with an accuracy of 98% for breast cancer classification.       Conclusion: Machine learning approaches are extensively used in medical prediction and decision support systems. This study showed that multilayer perceptron and logistic regression algorithms are  performant  ( good accuracy specificity and sensitivity) compared to the  other evaluated algorithms

    Effective algorithms to predict customer churn in financial services

    Get PDF
    Abstract: Please refer to full text to view abstract.M.Eng. (Electrical and Electronic Engineering Science

    Artificial Intelligence in Organisation and Managerial Studies: A Computational Literature Review

    Get PDF
    The goal of this paper is to develop a complete overview of the current debate on artificial intelligence in organisation and managerial studies. To this end, we adopted the Computational Literature Review (CLR) method to conduct an impact and a topic modelling analysis of the relevant literature, using the Latent Dirichlet Allocation (LDA) technique. As a result, we identified 15 topics concerning the artificial intelligence debate in organisation studies, providing a detailed description of each of them and identifying which one is declining, stable or emerging. We also recognized two main branches of research regarding technical and societal aspects, where the latter is becoming increasingly important in recent years. Finally, focusing on the emerging topics, we proposed a set of guiding questions that might foster future research directions. This paper provides insights to scholars and managers interested in AI and could be used also as guide to perform CLR
    corecore