2,761 research outputs found
A System for Induction of Oblique Decision Trees
This article describes a new system for induction of oblique decision trees.
This system, OC1, combines deterministic hill-climbing with two forms of
randomization to find a good oblique split (in the form of a hyperplane) at
each node of a decision tree. Oblique decision tree methods are tuned
especially for domains in which the attributes are numeric, although they can
be adapted to symbolic or mixed symbolic/numeric attributes. We present
extensive empirical studies, using both real and artificial data, that analyze
OC1's ability to construct oblique trees that are smaller and more accurate
than their axis-parallel counterparts. We also examine the benefits of
randomization for the construction of oblique decision trees.Comment: See http://www.jair.org/ for an online appendix and other files
accompanying this articl
Tree-based Intelligent Intrusion Detection System in Internet of Vehicles
The use of autonomous vehicles (AVs) is a promising technology in Intelligent
Transportation Systems (ITSs) to improve safety and driving efficiency.
Vehicle-to-everything (V2X) technology enables communication among vehicles and
other infrastructures. However, AVs and Internet of Vehicles (IoV) are
vulnerable to different types of cyber-attacks such as denial of service,
spoofing, and sniffing attacks. In this paper, an intelligent intrusion
detection system (IDS) is proposed based on tree-structure machine learning
models. The results from the implementation of the proposed intrusion detection
system on standard data sets indicate that the system has the ability to
identify various cyber-attacks in the AV networks. Furthermore, the proposed
ensemble learning and feature selection approaches enable the proposed system
to achieve high detection rate and low computational cost simultaneously.Comment: Accepted in IEEE Global Communications Conference (GLOBECOM) 201
Improved customer choice predictions using ensemble methods
In this paper various ensemble learning methods from machinelearning and statistics are considered and applied to the customerchoice modeling problem. The application of ensemble learningusually improves the prediction quality of flexible models likedecision trees and thus leads to improved predictions. We giveexperimental results for two real-life marketing datasets usingdecision trees, ensemble versions of decision trees and thelogistic regression model, which is a standard approach for thisproblem. The ensemble models are found to improve upon individualdecision trees and outperform logistic regression.Next, an additive decomposition of the prediction error of amodel, the bias/variance decomposition, is considered. A modelwith a high bias lacks the flexibility to fit the data well. Ahigh variance indicates that a model is instable with respect todifferent datasets. Decision trees have a high variance componentand a low bias component in the prediction error, whereas logisticregression has a high bias component and a low variance component.It is shown that ensemble methods aim at minimizing the variancecomponent in the prediction error while leaving the bias componentunaltered. Bias/variance decompositions for all models for bothcustomer choice datasets are given to illustrate these concepts.brand choice;data mining;boosting;choice models;Bias/Variance decomposition;Bagging;CART;ensembles
Recommended from our members
The role of classifiers in feature selection: Number vs nature
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Wrapper feature selection approaches are widely used to select a small subset of relevant features from a dataset. However, Wrappers suffer from the fact that they only use a single classifier when selecting the features. The problem of using a single classifier is that each classifier is of a different nature and will have its own biases. This means that each classifier will select different feature subsets. To address this problem, this thesis aims to investigate the effects of using different classifiers for Wrapper feature selection. More specifically, it aims to investigate the effects of using different number of classifiers and classifiers of different nature.
This aim is achieved by proposing a new data mining method called Wrapper-based Decision Trees (WDT). The WDT method has the ability to combine multiple classifiers from four different families, including Bayesian Network, Decision Tree, Nearest Neighbour and Support Vector Machine, to select relevant features and visualise the relationships among the selected features using decision trees. Specifically, the WDT method is applied to investigate three research questions of this thesis: (1) the effects of number of classifiers on feature selection results; (2) the effects of nature of classifiers on feature selection results; and (3) which of the two (i.e., number or nature of classifiers) has more of an effect on feature selection results. Two types of user preference datasets derived from Human-Computer Interaction (HCI) are used with WDT to assist in answering these three research questions.
The results from the investigation revealed that the number of classifiers and nature of classifiers greatly affect feature selection results. In terms of number of classifiers, the results showed that few classifiers selected many relevant features whereas many classifiers selected few relevant features. In addition, it was found that using three classifiers resulted in highly accurate feature subsets. In terms of nature of classifiers, it was showed that Decision Tree, Bayesian Network and Nearest Neighbour classifiers caused signficant differences in both the number of features selected and the accuracy levels of the features. A comparison of results regarding number of classifiers and nature of classifiers revealed that the former has more of an effect on feature selection than the latter.
The thesis makes contributions to three communities: data mining, feature selection, and HCI. For the data mining community, this thesis proposes a new method called WDT which integrates the use of multiple classifiers for feature selection and decision trees to effectively select and visualise the most relevant features within a dataset. For the feature selection community, the results of this thesis have showed that the number of classifiers and nature of classifiers can truly affect the feature selection process. The results and suggestions based on the results can provide useful insight about classifiers when performing feature selection. For the HCI community, this thesis has showed the usefulness of feature selection for identifying a small number of highly relevant features for determining the preferences of different users
Improving Floating Search Feature Selection using Genetic Algorithm
Classification, a process for predicting the class of a given input data, is one of the most fundamental tasks in data mining. Classification performance is negatively affected by noisy data and therefore selecting features relevant to the problem is a critical step in classification, especially when applied to large datasets. In this article, a novel filter-based floating search technique for feature selection to select an optimal set of features for classification purposes is proposed. A genetic algorithm is employed to improve the quality of the features selected by the floating search method in each iteration. A criterion function is applied to select relevant and high-quality features that can improve classification accuracy. The proposed method was evaluated using 20 standard machine learning datasets of various size and complexity. The results show that the proposed method is effective in general across different classifiers and performs well in comparison with recently reported techniques. In addition, the application of the proposed method with support vector machine provides the best performance among the classifiers studied and outperformed previous researches with the majority of data sets
Decision tree rule-based feature selection for imbalanced data
A class imbalance problem appears in many real world applications, e.g., fault diagnosis, text categorization and fraud detection. When dealing with an imbalanced dataset, feature selection becomes an important issue. To address it, this work proposes a feature selection method that is based on a decision tree rule and weighted Gini index. The effectiveness of the proposed methods is verified by classifying a dataset from Santander Bank and two datasets from UCI machine learning repository. The results show that our methods can achieve higher Area Under the Curve (AUC) and F-measure. We also compare them with filter-based feature selection approaches, i.e., Chi-Square and F-statistic. The results show that they outperform them but need slightly more computational efforts
Rule-based Machine Learning Methods for Functional Prediction
We describe a machine learning method for predicting the value of a
real-valued function, given the values of multiple input variables. The method
induces solutions from samples in the form of ordered disjunctive normal form
(DNF) decision rules. A central objective of the method and representation is
the induction of compact, easily interpretable solutions. This rule-based
decision model can be extended to search efficiently for similar cases prior to
approximating function values. Experimental results on real-world data
demonstrate that the new techniques are competitive with existing machine
learning and statistical methods and can sometimes yield superior regression
performance.Comment: See http://www.jair.org/ for any accompanying file
- β¦