110,328 research outputs found

    Multiclass Cancer Classification by Using Fuzzy Support Vector Machine and Binary Decision Tree With Gene Selection

    Get PDF
    We investigate the problems of multiclass cancer classification with gene selection from gene expression data. Two different constructed multiclass classifiers with gene selection are proposed, which are fuzzy support vector machine (FSVM) with gene selection and binary classification tree based on SVM with gene selection. Using F test and recursive feature elimination based on SVM as gene selection methods, binary classification tree based on SVM with F test, binary classification tree based on SVM with recursive feature elimination based on SVM, and FSVM with recursive feature elimination based on SVM are tested in our experiments. To accelerate computation, preselecting the strongest genes is also used. The proposed techniques are applied to analyze breast cancer data, small round blue-cell tumors, and acute leukemia data. Compared to existing multiclass cancer classifiers and binary classification tree based on SVM with F test or binary classification tree based on SVM with recursive feature elimination based on SVM mentioned in this paper, FSVM based on recursive feature elimination based on SVM can find most important genes that affect certain types of cancer with high recognition accuracy

    Sentiment Analysis on Social Media Via Machine Learning

    Get PDF
    Social media are shaping users\u27 attitudes and behaviors through spreading information anytime and anywhere. Monitoring user opinions on social media is an effective solution to measure users\u27 preferences towards brands or events. Currently, supervised machine learning-based methods dominate this area. However, as far as we know, there is no comprehensive comparison of performances of different models to figure out which model will be better for individual datasets. The focus of this thesis is to compare the performance of different supervised machine learning models. In detail, we built six classifiers, including support vector machine, random forest, neural network, Adaboost, decision tree, and Naive Bayes on two datasets and compare their performance. Furthermore, we introduced feature selection to remove unrelated attributes to preprocess the data and compare performance by building classifiers on the preprocessed data. Experimental results show that without feature selection, there is no significant difference in the performance. After feature selection, random forest outperformed other classifiers

    A hybrid supervised machine learning classifier system for breast cancer prognosis using feature selection and data imbalance handling approaches

    Get PDF
    Nowadays, breast cancer is the most frequent cancer among women. Early detection is a critical issue that can be effectively achieved by machine learning (ML) techniques. Thus in this article, the methods to improve the accuracy of ML classification models for the prognosis of breast cancer are investigated. Wrapper-based feature selection approach along with nature-inspired algorithms such as Particle Swarm Optimization, Genetic Search, and Greedy Stepwise has been used to identify the important features. On these selected features popular machine learning classifiers Support Vector Machine, J48 (C4.5 Decision Tree Algorithm), Multilayer-Perceptron (a feed-forward ANN) were used in the system. The methodology of the proposed system is structured into five stages which include (1) Data Pre-processing; (2) Data imbalance handling; (3) Feature Selection; (4) Machine Learning Classifiers; (5) classifier's performance evaluation. The dataset under this research experimentation is referred from the UCI Machine Learning Repository, named Breast Cancer Wisconsin (Diagnostic) Data Set. This article indicated that the J48 decision tree classifier is the appropriate machine learning-based classifier for optimum breast cancer prognosis. Support Vector Machine with Particle Swarm Optimization algorithm for feature selection achieves the accuracy of 98.24%, MCC = 0.961, Sensitivity = 99.11%, Specificity = 96.54%, and Kappa statistics of 0.9606. It is also observed that the J48 Decision Tree classifier with the Genetic Search algorithm for feature selection achieves the accuracy of 98.83%, MCC = 0.974, Sensitivity = 98.95%, Specificity = 98.58%, and Kappa statistics of 0.9735. Furthermore, Multilayer Perceptron ANN classifier with Genetic Search algorithm for feature selection achieves the accuracy of 98.59%, MCC = 0.968, Sensitivity = 98.6%, Specificity = 98.57%, and Kappa statistics of 0.9682.Web of Science106art. no. 69

    Human Activity Recognition: A Comparison of Machine Learning Approaches

    Get PDF
    This study aims to investigate the performance of Machine Learning (ML) techniques used in Human Activity Recognition (HAR). Techniques considered are Naïve Bayes, Support Vector Machine, K-Nearest Neighbor, Logistic Regression, Stochastic Gradient Descent, Decision Tree, Decision Tree with entropy, Random Forest, Gradient Boosting Decision Tree, and NGBoost algorithm. Following the activity recognition chain model for preprocessing, segmentation, feature extraction, and classification of human activities, we evaluate these ML techniques against classification performance metrics such as accuracy, precision, recall, F1 score, support, and run time on multiple HAR datasets. The findings highlight the importance to tailor the selection of ML technique based on the specific HAR requirements and the characteristics of the associated HAR dataset. Overall, this research helps in understanding the merits and shortcomings of ML techniques and guides the applicability of different ML techniques to various HAR datasets

    Feature Selection via Binary Simultaneous Perturbation Stochastic Approximation

    Full text link
    Feature selection (FS) has become an indispensable task in dealing with today's highly complex pattern recognition problems with massive number of features. In this study, we propose a new wrapper approach for FS based on binary simultaneous perturbation stochastic approximation (BSPSA). This pseudo-gradient descent stochastic algorithm starts with an initial feature vector and moves toward the optimal feature vector via successive iterations. In each iteration, the current feature vector's individual components are perturbed simultaneously by random offsets from a qualified probability distribution. We present computational experiments on datasets with numbers of features ranging from a few dozens to thousands using three widely-used classifiers as wrappers: nearest neighbor, decision tree, and linear support vector machine. We compare our methodology against the full set of features as well as a binary genetic algorithm and sequential FS methods using cross-validated classification error rate and AUC as the performance criteria. Our results indicate that features selected by BSPSA compare favorably to alternative methods in general and BSPSA can yield superior feature sets for datasets with tens of thousands of features by examining an extremely small fraction of the solution space. We are not aware of any other wrapper FS methods that are computationally feasible with good convergence properties for such large datasets.Comment: This is the Istanbul Sehir University Technical Report #SHR-ISE-2016.01. A short version of this report has been accepted for publication at Pattern Recognition Letter

    Efficient Feature Selection and ML Algorithm for Accurate Diagnostics

    Get PDF
    Machine learning algorithms have been deployed in numerous optimization, prediction and classification problems. This has endeared them for application in fields such as computer networks and medical diagnosis. Although these machine learning algorithms achieve convincing results in these fields, they face numerous challenges when deployed on imbalanced dataset. Consequently, these algorithms are often biased towards majority class, hence unable to generalize the learning process. In addition, they are unable to effectively deal with high-dimensional datasets. Moreover, the utilization of conventional feature selection techniques from a dataset based on attribute significance render them ineffective for majority of the diagnosis applications. In this paper, feature selection is executed using the more effective Neighbour Components Analysis (NCA). During the classification process, an ensemble classifier comprising of K-Nearest Neighbours (KNN), Naive Bayes (NB), Decision Tree (DT) and Support Vector Machine (SVM) is built, trained and tested. Finally, cross validation is carried out to evaluate the developed ensemble model. The results shows that the proposed classifier has the best performance in terms of precision, recall, F-measure and classification accuracy

    Evaluation of Three Feature Dimension Reduction Techniques for Machine Learning-Based Crop Yield Prediction Models

    Get PDF
    Machine learning (ML) has been widely used worldwide to develop crop yield forecasting models. However, it is still challenging to identify the most critical features from a dataset. Although either feature selection (FS) or feature extraction (FX) techniques have been employed, no research compares their performances and, more importantly, the benefits of combining both methods. Therefore, this paper proposes a framework that uses non-feature reduction (All-F) as a baseline to investigate the performance of FS, FX, and a combination of both (FSX). The case study employs the vegetation condition index (VCI)/temperature condition index (TCI) to develop 21 rice yield forecasting models for eight sub-regions in Vietnam based on ML methods, namely linear, support vector machine (SVM), decision tree (Tree), artificial neural network (ANN), and Ensemble. The results reveal that FSX takes full advantage of the FS and FX, leading FSX-based models to perform the best in 18 out of 21 models, while 2 (1) for FS-based (FX-based) models. These FXS-, FS-, and FX-based models improve All-F-based models at an average level of 21% and up to 60% in terms of RMSE. Furthermore, 21 of the best models are developed based on Ensemble (13 models), Tree (6 models), linear (1 model), and ANN (1 model). These findings highlight the significant role of FS, FX, and specially FSX coupled with a wide range of ML algorithms (especially Ensemble) for enhancing the accuracy of predicting crop yield

    Comparison of Classification Algorithm for Crop Decision based on Environmental Factors using Machine Learning

    Get PDF
    Crop decision is a very complex process. In Agriculture it plays a vital role. Various biotic and abiotic factors affect this decision. Some crucial Environmental factors are Nitrogen Phosphorus, Potassium, pH, Temperature, Humidity, Rainfall. Machine Learning Algorithm can perfectly predict the crop necessary for this environmental condition. Various algorithms and model are used for this process such as feature selection, data cleaning, Training, and testing split etc. Algorithms such as Logistic regression, Decision Tree, Support vector machine, K- Nearest Neighbour, Navies Bayes, Random Forest. A comparison based on the accuracy parameter is presented in this paper along with various training and testing split for optimal choice of best algorithm. This comparison is done on two tools i.e., on Google collab using python and its libraries for implementation of Machine Learning Algorithm and WEKA which is a pre-processing tool to compare various algorithm of machine learning
    corecore