78,250 research outputs found

    An Efficient Network Intrusion Detection Based on Decision Tree Classifier & Simple K-Mean Clustering using Dimensionality Reduction – A Review

    Get PDF
    As the internet size grows rapidly so that the attacks on network. There is a need of intrusion detection system (IDS) but large and increasing size of network creates huge computational values which can be a problem in estimating data mining results this problem can be overcome using dimensionality reduction as a part of data preprocessing. In this paper we study two decision tree classifiers(J48, Id3) for the purpose of detecting any intrusion and comparing their performances .first we have applied data pre processing steps on each classifier which includes feature selection using attribute selection filter , Intrusion detection dataset is KDDCUP 99 dataset which has 42 features after preprocessing 9 selected attributes remains ,then discretization of selected attribute is performed ,simple k-Mean algorithm is used for analysis of data and Based on this study, we have concluded that J48 has higher classification accuracy with high true positive rate (TPR) and low false positive rate (FPR) as compared to ID3 decision tree classifiers. DOI: 10.17762/ijritcc2321-8169.15027

    Discriminative Gene Selection Employing Linear Regression Model

    Get PDF
    Microarray datasets enables the analysis of expression of thousands of genes across hundreds of samples. Usually classifiers do not perform well for large number of features (genes) as is the case of microarray datasets. That is why a small number of informative and discriminative features are always desirable for efficient classification. Many existing feature selection approaches have been proposed which attempts sample classification based on the analysis of gene expression values. In this paper a linear regression based feature selection algorithm for two class microarray datasets has been developed which divides the training dataset into two subtypes based on the class information. Using one of the classes as the base condition, a linear regression based model is developed. Using this regression model the divergence of each gene across the two classes are calculated and thus genes with higher divergence values are selected as important features from the second subtype of the training data. The classification performance of the proposed approach is evaluated with SVM, Random Forest and AdaBoost classifiers. Results show that the proposed approach provides better accuracy values compared to other existing approaches i.e. ReliefF, CFS, decision tree based attribute selector and attribute selection using correlation analysis

    An extended ID3 decision tree algorithm for spatial data

    Get PDF
    Utilizing data mining tasks such as classification on spatial data is more complex than those on non-spatial data. It is because spatial data mining algorithms have to consider not only objects of interest itself but also neighbours of the objects in order to extract useful and interesting patterns. One of classification algorithms namely the ID3 algorithm which originally designed for a non-spatial dataset has been improved by other researchers in the previous work to construct a spatial decision tree from a spatial dataset containing polygon features only. The objective of this paper is to propose a new spatial decision tree algorithm based on the ID3 algorithm for discrete features represented in points, lines and polygons. As in the ID3 algorithm that use information gain in the attribute selection, the proposed algorithm uses the spatial information gain to choose the best splitting layer from a set of explanatory layers. The new formula for spatial information gain is proposed using spatial measures for point, line and polygon features. Empirical result demonstrates that the proposed algorithm can be used to join two spatial objects in constructing spatial decision trees on small spatial dataset. The proposed algorithm has been applied to the real spatial dataset consisting of point and polygon features. The result is a spatial decision tree with 138 leaves and the accuracy is 74.72%

    An extended ID3 decision tree algorithm for spatial data

    Get PDF
    Utilizing data mining tasks such as classification on spatial data is more complex than those on non-spatial data. It is because spatial data mining algorithms have to consider not only objects of interest itself but also neighbours of the objects in order to extract useful and interesting patterns. One of classification algorithms namely the ID3 algorithm which originally designed for a non-spatial dataset has been improved by other researchers in the previous work to construct a spatial decision tree from a spatial dataset containing polygon features only. The objective of this paper is to propose a new spatial decision tree algorithm based on the ID3 algorithm for discrete features represented in points, lines and polygons. As in the ID3 algorithm that use information gain in the attribute selection, the proposed algorithm uses the spatial information gain to choose the best splitting layer from a set of explanatory layers. The new formula for spatial information gain is proposed using spatial measures for point, line and polygon features. Empirical result demonstrates that the proposed algorithm can be used to join two spatial objects in constructing spatial decision trees on small spatial dataset. The proposed algorithm has been applied to the real spatial dataset consisting of point and polygon features. The result is a spatial decision tree with 138 leaves and the accuracy is 74.72%

    Optimal Thresholds for Classification Trees using Nonparametric Predictive Inference

    Get PDF
    In data mining, classification is used to assign a new observation to one of a set of predefined classes based on the attributes of the observation. Classification trees are one of the most commonly used methods in the area of classification because their rules are easy to understand and interpret. Classification trees are constructed recursively by a top-down scheme using repeated splits of the training data set, which is a subset of the data. When the data set involves a continuous-valued attribute, there is a need to select an appropriate threshold value to determine the classes and split the data. In recent years, Nonparametric Predictive Inference (NPI) has been introduced for selecting optimal thresholds for two- and three-class classification problems, where the inferences are explicitly in terms of a given number of future observations and target proportions. These target proportions enable one to choose weights that reflect the relative importance of one class over another. The NPI-based threshold selection method has previously been implemented in the context of Receiver Operating Characteristic (ROC) analysis, but not for building classification trees. Due to the predictive nature of the NPI-based threshold selection method, it is well suited for the classification tree method, as the end goal of building classification trees is to use them for prediction as well. In this thesis, we present new classification algorithms for building classification trees using the NPI approach for selecting the optimal thresholds. We first present a new classification algorithm, which we call the NPI2-Tree algorithm, for building binary classification trees; we then extend it to build classification trees with three ordered classes, which we call the NPI3-Tree algorithm. In order to build classification trees using our algorithms, we introduce a new procedure for selecting the optimal values of target proportions by optimising classification performance on test data. We use different measures to evaluate and compare the performance of the NPI2-Tree and the NPI3-Tree classification algorithms with other classification algorithms from the literature. The experimental results show that our classification algorithms perform well compared to other algorithms. Finally, we present applications of the NPI2-Tree and NPI3-Tree classification algorithms on noisy data sets. Noise refers to situations that occur when the data sets used for classification tasks have incorrect values in the attribute variables or the class variable. The performances of the NPI2-Tree and NPI3-Tree classification algorithms in the case of noisy data are evaluated using different levels of noise added to the class variable. The results show that our classification algorithms perform well in case of noisy data and tend to be quite robust for most noise levels, compared to other classification algorithms

    A survey of cost-sensitive decision tree induction algorithms

    Get PDF
    The past decade has seen a significant interest on the problem of inducing decision trees that take account of costs of misclassification and costs of acquiring the features used for decision making. This survey identifies over 50 algorithms including approaches that are direct adaptations of accuracy based methods, use genetic algorithms, use anytime methods and utilize boosting and bagging. The survey brings together these different studies and novel approaches to cost-sensitive decision tree learning, provides a useful taxonomy, a historical timeline of how the field has developed and should provide a useful reference point for future research in this field

    A Decision tree-based attribute weighting filter for naive Bayes

    Get PDF
    The naive Bayes classifier continues to be a popular learning algorithm for data mining applications due to its simplicity and linear run-time. Many enhancements to the basic algorithm have been proposed to help mitigate its primary weakness--the assumption that attributes are independent given the class. All of them improve the performance of naïve Bayes at the expense (to a greater or lesser degree) of execution time and/or simplicity of the final model. In this paper we present a simple filter method for setting attribute weights for use with naive Bayes. Experimental results show that naive Bayes with attribute weights rarely degrades the quality of the model compared to standard naive Bayes and, in many cases, improves it dramatically. The main advantages of this method compared to other approaches for improving naive Bayes is its run-time complexity and the fact that it maintains the simplicity of the final model
    corecore