54,097 research outputs found

    Clustering based feature selection using Partitioning Around Medoids (PAM)

    Get PDF
    High-dimensional data contains a large number of features. With many features, high dimensional data requires immense computational resources, including space and time. Several studies indicate that not all features of high dimensional data are relevant to classification result. Dimensionality reduction is inevitable and is required due to classifier performance improvement. Several dimensionality reduction techniques were carried out, including feature selection techniques and feature extraction techniques. Sequential forward feature selection and backward feature selection are feature selection using the greedy approach. The heuristics approach is also applied in feature selection, using the Genetic Algorithm, PSO, and Forest Optimization Algorithm. PCA is the most well-known feature extraction method. Besides, other methods such as multidimensional scaling and linear discriminant analysis. In this work, a different approach is applied to perform feature selection. Cluster analysis based feature selection using Partitioning Around Medoids (PAM) clustering is carried out. Our experiment results showed that classification accuracy gained when using feature vectors' medoids to represent the original dataset is high, above 80%

    A Scalable Feature Selection and Opinion Miner Using Whale Optimization Algorithm

    Get PDF
    Due to the fast-growing volume of text documents and reviews in recent years, current analyzing techniques are not competent enough to meet the users' needs. Using feature selection techniques not only support to understand data better but also lead to higher speed and also accuracy. In this article, the Whale Optimization algorithm is considered and applied to the search for the optimum subset of features. As known, F-measure is a metric based on precision and recall that is very popular in comparing classifiers. For the evaluation and comparison of the experimental results, PART, random tree, random forest, and RBF network classification algorithms have been applied to the different number of features. Experimental results show that the random forest has the best accuracy on 500 features. Keywords: Feature selection, Whale Optimization algorithm, Selecting optimal, Classification algorith

    Implementation of Particle Swarm Optimization on Sentiment Analysis of Cyberbullying using Random Forest

    Get PDF
    Social media has exerted a significant influence on the lives of the majority of individuals in the contemporary era. It not only enables communication among people within specific environments but also facilitates user connectivity in the virtual realm. Instagram is a social media platform that plays a pivotal role in the sharing of information and fostering communication among its users through the medium of photos and videos, which can be commented on by other users. The utilization of Instagram is consistently growing each year, thereby potentially yielding both positive and negative consequences. One prevalent negative consequence that frequently arises is cyberbullying. Conducting sentiment analysis on cyberbullying data can provide insights into the effectiveness of the employed methodology. This research was conducted as an experimental research, aiming to compare the performance of Random Forest and Random Forest after applying the Particle Swarm Optimization feature selection technique on three distinct data split compositions, namely 70:30, 80:20, and 90:10. The evaluation results indicate that the highest accuracy scores were achieved in the 90:10 data split configuration. Specifically, the Random Forest model yielded an accuracy of 87.50%, while the Random Forest model, after undergoing feature selection using the Particle Swarm Optimization algorithm, achieved an accuracy of 92.19%. Therefore, the implementation of Particle Swarm Optimization as a feature selection technique demonstrates the potential to enhance the accuracy of the Random Forest method

    TSE-IDS: A Two-Stage Classifier Ensemble for Intelligent Anomaly-based Intrusion Detection System

    Get PDF
    Intrusion detection systems (IDS) play a pivotal role in computer security by discovering and repealing malicious activities in computer networks. Anomaly-based IDS, in particular, rely on classification models trained using historical data to discover such malicious activities. In this paper, an improved IDS based on hybrid feature selection and two-level classifier ensembles is proposed. An hybrid feature selection technique comprising three methods, i.e. particle swarm optimization, ant colony algorithm, and genetic algorithm, is utilized to reduce the feature size of the training datasets (NSL-KDD and UNSW-NB15 are considered in this paper). Features are selected based on the classification performance of a reduced error pruning tree (REPT) classifier. Then, a two-level classifier ensembles based on two meta learners, i.e., rotation forest and bagging, is proposed. On the NSL-KDD dataset, the proposed classifier shows 85.8% accuracy, 86.8% sensitivity, and 88.0% detection rate, which remarkably outperform other classification techniques recently proposed in the literature. Results regarding the UNSW-NB15 dataset also improve the ones achieved by several state of the art techniques. Finally, to verify the results, a two-step statistical significance test is conducted. This is not usually considered by IDS research thus far and, therefore, adds value to the experimental results achieved by the proposed classifier

    BUGOPTIMIZE: Bugs dataset Optimization with Majority Vote Cluster-Based Fine-Tuned Feature Selection for Scalable Handling

    Get PDF
    Software bugs are prevalent in the software development lifecycle, posing challenges to developers in ensuring product quality and reliability. Accurate prediction of bug counts can significantly aid in resource allocation and prioritization of bug-fixing efforts. However, the vast number of attributes in bug datasets often requires effective feature selection techniques to enhance prediction accuracy and scalability. Existing feature selection methods, though diverse, suffer from limitations such as suboptimal feature subsets and lack of scalability. This paper proposes BUGOPTIMIZE, a novel algorithm tailored to address these challenges. BUGOPTIMIZE innovatively integrates majority voting cluster-based fine-tuned feature selection to optimize bug datasets for scalable handling and accurate prediction. The algorithm initiates by clustering the dataset using K-means, EM, and Hierarchical clustering algorithms and performs majority voting to assign data points to final clusters. It then employs filter-based, wrapper-based, and embedded feature selection techniques within each cluster to identify common features. Additionally, feature selection is applied to the entire dataset to extract another set of common features. These selected features are combined to form the final best feature set. Experimental results demonstrate the efficacy of BUGOPTIMIZE compared to existing feature selection methods, reducing MAE and RMSE in Linear Regression (MAE: 0.2668 to 0.2609, RMSE: 0.3251 to 0.308) and Random Forest (MAE: 0.1626 to 0.1341, RMSE: 0.2363 to 0.224), highlighting its significant contribution to bug dataset optimization and prediction accuracy in software development while addressing feature selection limitations. By mitigating the disadvantages of current approaches and introducing a comprehensive and scalable solution, BUGOPTIMIZE presents a significant advancement in bug dataset optimization and prediction accuracy in software development environments

    Gene selection and classification for cancer microarray data based on machine learning and similarity measures

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Microarray data have a high dimension of variables and a small sample size. In microarray data analyses, two important issues are how to choose genes, which provide reliable and good prediction for disease status, and how to determine the final gene set that is best for classification. Associations among genetic markers mean one can exploit information redundancy to potentially reduce classification cost in terms of time and money.</p> <p>Results</p> <p>To deal with redundant information and improve classification, we propose a gene selection method, Recursive Feature Addition, which combines supervised learning and statistical similarity measures. To determine the final optimal gene set for prediction and classification, we propose an algorithm, Lagging Prediction Peephole Optimization. By using six benchmark microarray gene expression data sets, we compared Recursive Feature Addition with recently developed gene selection methods: Support Vector Machine Recursive Feature Elimination, Leave-One-Out Calculation Sequential Forward Selection and several others.</p> <p>Conclusions</p> <p>On average, with the use of popular learning machines including Nearest Mean Scaled Classifier, Support Vector Machine, Naive Bayes Classifier and Random Forest, Recursive Feature Addition outperformed other methods. Our studies also showed that Lagging Prediction Peephole Optimization is superior to random strategy; Recursive Feature Addition with Lagging Prediction Peephole Optimization obtained better testing accuracies than the gene selection method varSelRF.</p

    Clustering based feature selection using Partitioning Around Medoids (PAM)

    Get PDF
    High-dimensional data contains a large number of features. With many features, high dimensional data requires immense computational resources, including space and time. Several studies indicate that not all features of high dimensional data are relevant to classification result. Dimensionality reduction is inevitable and is required due to classifier performance improvement. Several dimensionality reduction techniques were carried out, including feature selection techniques and feature extraction techniques. Sequential forward feature selection and backward feature selection are feature selection using the greedy approach. The heuristics approach is also applied in feature selection, using the Genetic Algorithm, PSO, and Forest Optimization Algorithm. PCA is the most well-known feature extraction method. Besides, other methods such as multidimensional scaling and linear discriminant analysis. In this work, a different approach is applied to perform feature selection. Cluster analysis based feature selection using Partitioning Around Medoids (PAM) clustering is carried out. Our experiment results showed that classification accuracy gained when using feature vectors' medoids to represent the original dataset is high, above 80%

    Exploring the Time-efficient Evolutionary-based Feature Selection Algorithms for Speech Data under Stressful Work Condition

    Get PDF
    Initially, the goal of Machine Learning (ML) advancements is faster computation time and lower computation resources, while the curse of dimensionality burdens both computation time and resource. This paper describes the benefits of the Feature Selection Algorithms (FSA) for speech data under workload stress. FSA contributes to reducing both data dimension and computation time and simultaneously retains the speech information. We chose to use the robust Evolutionary Algorithm, Harmony Search, Principal Component Analysis, Genetic Algorithm, Particle Swarm Optimization, Ant Colony Optimization, and Bee Colony Optimization, which are then to be evaluated using the hierarchical machine learning models. These FSAs are explored with the conversational workload stress data of a Customer Service hotline, which has daily complaints that trigger stress in speaking. Furthermore, we employed precisely 223 acoustic-based features. Using Random Forest, our evaluation result showed computation time had improved 3.6 faster than the original 223 features employed. Evaluation using Support Vector Machine beat the record with 0.001 seconds of computation time

    Intelligent feature selection using particle swarm optimization algorithm with a decision tree for DDoS attack detection

    Get PDF
    The explosive development of information technology is increasingly rising cyber-attacks. Distributed denial of service (DDoS) attack is a malicious threat to the modern cyber-security world, which causes performance disruption to the network servers. It is a pernicious type of attack that can forward a large amount of traffic to damage one or all target’s resources simultaneously and prevents authenticated users from accessing network services. The paper aims to select the least number of relevant DDoS attack detection features by designing an intelligent wrapper feature selection model that utilizes a binary-particle swarm optimization algorithm with a decision tree classifier. In this paper, the Binary-particle swarm optimization algorithm is used to resolve discrete optimization problems such as feature selection and decision tree classifier as a performance evaluator to evaluate the wrapper model’s accuracy using the selected features from the network traffic flows. The model’s intelligence is indicated by selecting 19 convenient features out of 76 features of the dataset. The experiments were accomplished on a large DDoS dataset. The optimal selected features were evaluated with different machine learning algorithms by performance measurement metrics regarding the accuracy, Recall, Precision, and F1-score to detect DDoS attacks. The proposed model showed a high accuracy rate by decision tree classifier 99.52%, random forest 96.94%, and multi-layer perceptron 90.06 %. Also, the paper compares the outcome of the proposed model with previous feature selection models in terms of performance measurement metrics. This outcome will be useful for improving DDoS attack detection systems based on machine learning algorithms. It is also probably applied to other research topics such as DDoS attack detection in the cloud environment and DDoS attack mitigation systems
    corecore