5,805 research outputs found
A System for Induction of Oblique Decision Trees
This article describes a new system for induction of oblique decision trees.
This system, OC1, combines deterministic hill-climbing with two forms of
randomization to find a good oblique split (in the form of a hyperplane) at
each node of a decision tree. Oblique decision tree methods are tuned
especially for domains in which the attributes are numeric, although they can
be adapted to symbolic or mixed symbolic/numeric attributes. We present
extensive empirical studies, using both real and artificial data, that analyze
OC1's ability to construct oblique trees that are smaller and more accurate
than their axis-parallel counterparts. We also examine the benefits of
randomization for the construction of oblique decision trees.Comment: See http://www.jair.org/ for an online appendix and other files
accompanying this articl
Comparison of the CPU and memory performance of StatPatternRecognition (SPR) and Toolkit for MultiVariate Analysis (TMVA)
High Energy Physics data sets are often characterized by a huge number of
events. Therefore, it is extremely important to use statistical packages able
to efficiently analyze these unprecedented amounts of data. We compare the
performance of the statistical packages StatPatternRecognition (SPR) and
Toolkit for MultiVariate Analysis (TMVA). We focus on how CPU time and memory
usage of the learning process scale versus data set size. As classifiers, we
consider Random Forests, Boosted Decision Trees and Neural Networks. For our
tests, we employ a data set widely used in the machine learning community,
"Threenorm" data set, as well as data tailored for testing various edge cases.
For each data set, we constantly increase its size and check CPU time and
memory needed to build the classifiers implemented in SPR and TMVA. We show
that SPR is often significantly faster and consumes significantly less memory.
For example, the SPR implementation of Random Forest is by an order of
magnitude faster and consumes an order of magnitude less memory than TMVA on
Threenorm data
Optimization of Signal Significance by Bagging Decision Trees
An algorithm for optimization of signal significance or any other
classification figure of merit suited for analysis of high energy physics (HEP)
data is described. This algorithm trains decision trees on many bootstrap
replicas of training data with each tree required to optimize the signal
significance or any other chosen figure of merit. New data are then classified
by a simple majority vote of the built trees. The performance of this algorithm
has been studied using a search for the radiative leptonic decay B->gamma l nu
at BaBar and shown to be superior to that of all other attempted classifiers
including such powerful methods as boosted decision trees. In the B->gamma e nu
channel, the described algorithm increases the expected signal significance
from 2.4 sigma obtained by an original method designed for the B->gamma l nu
analysis to 3.0 sigma.Comment: 8 pages, 2 figures, 1 tabl
- …