431 research outputs found

    Unsupervised feature selection method for intrusion detection system

    Full text link
    © 2015 IEEE. This paper considers the feature selection problem for data classification in the absence of data labels. It first proposes an unsupervised feature selection algorithm, which is an enhancement over the Laplacian score method, named an Extended Laplacian score, EL in short. Specifically, two main phases are involved in EL to complete the selection procedures. In the first phase, the Laplacian score algorithm is applied to select the features that have the best locality preserving power. In the second phase, EL proposes a Redundancy Penalization (RP) technique based on mutual information to eliminate the redundancy among the selected features. This technique is an enhancement over Battiti's MIFS. It does not require a user-defined parameter such as beta to complete the selection processes of the candidate feature set as it is required in MIFS. After tackling the feature selection problem, the final selected subset is then used to build an Intrusion Detection System. The effectiveness and the feasibility of the proposed detection system are evaluated using three well-known intrusion detection datasets: KDD Cup 99, NSL-KDD and Kyoto 2006+ dataset. The evaluation results confirm that our feature selection approach performs better than the Laplacian score method in terms of classification accuracy

    Effective Discriminative Feature Selection with Non-trivial Solutions

    Full text link
    Feature selection and feature transformation, the two main ways to reduce dimensionality, are often presented separately. In this paper, a feature selection method is proposed by combining the popular transformation based dimensionality reduction method Linear Discriminant Analysis (LDA) and sparsity regularization. We impose row sparsity on the transformation matrix of LDA through ℓ2,1{\ell}_{2,1}-norm regularization to achieve feature selection, and the resultant formulation optimizes for selecting the most discriminative features and removing the redundant ones simultaneously. The formulation is extended to the ℓ2,p{\ell}_{2,p}-norm regularized case: which is more likely to offer better sparsity when 0<p<10<p<1. Thus the formulation is a better approximation to the feature selection problem. An efficient algorithm is developed to solve the ℓ2,p{\ell}_{2,p}-norm based optimization problem and it is proved that the algorithm converges when 0<p≤20<p\le 2. Systematical experiments are conducted to understand the work of the proposed method. Promising experimental results on various types of real-world data sets demonstrate the effectiveness of our algorithm

    Conditional t-SNE: Complementary t-SNE embeddings through factoring out prior information

    Get PDF
    Dimensionality reduction and manifold learning methods such as t-Distributed Stochastic Neighbor Embedding (t-SNE) are routinely used to map high-dimensional data into a 2-dimensional space to visualize and explore the data. However, two dimensions are typically insufficient to capture all structure in the data, the salient structure is often already known, and it is not obvious how to extract the remaining information in a similarly effective manner. To fill this gap, we introduce \emph{conditional t-SNE} (ct-SNE), a generalization of t-SNE that discounts prior information from the embedding in the form of labels. To achieve this, we propose a conditioned version of the t-SNE objective, obtaining a single, integrated, and elegant method. ct-SNE has one extra parameter over t-SNE; we investigate its effects and show how to efficiently optimize the objective. Factoring out prior knowledge allows complementary structure to be captured in the embedding, providing new insights. Qualitative and quantitative empirical results on synthetic and (large) real data show ct-SNE is effective and achieves its goal

    DiGAN breakthrough: advancing diabetic data analysis with innovative GAN-based imbalance correction techniques

    Get PDF
    In the rapidly evolving field of medical diagnostics, the challenge of imbalanced datasets, particularly in diabetes classification, calls for innovative solutions. The study introduces DiGAN, a groundbreaking approach that leverages the power of Generative Adversarial Networks (GAN) to revolutionize diabetes data analysis. Marking a significant departure from traditional methods, DiGAN applies GANs, typically seen in image processing, to the realm of diabetes data. This novel application is complemented by integrating the unsupervised Laplacian Score for sophisticated feature selection. The pioneering approach not only surpasses the limitations of existing techniques but also sets a new benchmark in classification accuracy with a 90% weighted F1-score, achieving a remarkable improvement of over 20% compared to conventional methods. Additionally, DiGAN demonstrates superior performance over popular SMOTE-based methods in handling extremely imbalanced datasets. This research, focusing on the integrated use of Laplacian Score, GAN, and Random Forest, stands at the forefront of diabetic classification, offering a uniquely effective and innovative solution to the long-standing data imbalance issue in medical diagnostics
    • …
    corecore