317 research outputs found
Statistical Sources of Variable Selection Bias in Classification Tree Algorithms Based on the Gini Index
Evidence for variable selection bias in classification tree algorithms based on the Gini Index is reviewed from the literature and embedded into a broader explanatory scheme: Variable selection bias in classification tree algorithms based on the Gini Index can be caused not only by the statistical effect of multiple comparisons, but also by an increasing estimation bias and variance of the splitting criterion when plug-in estimates of entropy measures like the Gini Index are employed. The relevance of these sources of variable selection bias in the different simulation study designs is examined. Variable selection bias due to the explored sources applies to all classification tree algorithms based on empirical entropy measures like the Gini Index, Deviance and Information Gain, and to both binary and multiway splitting algorithms
Complexes of not -connected graphs
Complexes of (not) connected graphs, hypergraphs and their homology appear in
the construction of knot invariants given by V. Vassiliev. In this paper we
study the complexes of not -connected -hypergraphs on vertices. We
show that the complex of not -connected graphs has the homotopy type of a
wedge of spheres of dimension . This answers one of the
questions raised by Vassiliev in connection with knot invariants. For this case
the -action on the homology of the complex is also determined. For
complexes of not -connected -hypergraphs we provide a formula for the
generating function of the Euler characteristic, and we introduce certain
lattices of graphs that encode their topology. We also present partial results
for some other cases. In particular, we show that the complex of not
-connected graphs is Alexander dual to the complex of partial matchings
of the complete graph. For not -connected graphs we provide a formula
for the generating function of the Euler characteristic
On partitioning multivariate self-affine time series
Given a multivariate time series, possibly of high dimension, with unknown and time-varying joint distribution, it is of interest to be able to completely partition the time series into disjoint, contiguous subseries, each of which has different distributional or pattern attributes from the preceding and succeeding subseries. An additional feature of many time series is that they display self-affinity, so that subseries at one time scale are similar to subseries at another after application of an affine transformation. Such qualities are observed in time series from many disciplines, including biology, medicine, economics, finance, and computer science. This paper defines the relevant multiobjective combinatorial optimization problem with limited assumptions as a biobjective one, and a specialized evolutionary algorithm is presented which finds optimal self-affine time series partitionings with a minimum of choice parameters. The algorithm not only finds partitionings for all possible numbers of partitions given data constraints, but also for self-affinities between these partitionings and some fine-grained partitioning. The resulting set of Pareto-efficient solution sets provides a rich representation of the self-affine properties of a multivariate time series at different locations and time scales
Unbiased split selection for classification trees based on the Gini Index
The Gini gain is one of the most common variable selection criteria in machine learning. We derive the exact distribution of the maximally selected Gini gain in the context of binary classification using continuous predictors by means of a combinatorial approach. This distribution provides a formal support for variable selection bias in favor of variables with a high amount of missing values when the Gini gain is used as split selection criterion, and we suggest to use the resulting p-value as an unbiased split selection criterion in recursive partitioning algorithms. We demonstrate the efficiency of our novel method in simulation- and real data- studies from veterinary gynecology in the context of binary classification and continuous predictor variables with different numbers of missing values. Our method is extendible to categorical and ordinal predictor variables and to other split selection criteria such as the cross-entropy criterion
On partitioning multivariate self-affine time series
Given a multivariate time series, possibly of high dimension, with unknown and time-varying joint distribution, it is of interest to be able to completely partition the time series into disjoint, contiguous subseries, each of which has different distributional or pattern attributes from the preceding and succeeding subseries. An additional feature of many time series is that they display self-affinity, so that subseries at one time scale are similar to subseries at another after application of an affine transformation. Such qualities are observed in time series from many disciplines, including biology, medicine, economics, finance, and computer science. This paper defines the relevant multiobjective combinatorial optimization problem with limited assumptions as a biobjective one, and a specialized evolutionary algorithm is presented which finds optimal self-affine time series partitionings with a minimum of choice parameters. The algorithm not only finds partitionings for all possible numbers of partitions given data constraints, but also for self-affinities between these partitionings and some fine-grained partitioning. The resulting set of Pareto-efficient solution sets provides a rich representation of the self-affine properties of a multivariate time series at different locations and time scales
Tree stability diagnostics and some remedies against instability
Stability aspects of recursive partitioning procedures are investigated. Using resampling techniques, diagnostic tools to assess single split stability and overall tree stability are introduced. To correct for the procedure's preference for covariates with many unique realizations, corrected p-values are used in the factor selection component of the algorithm. Finally, methods to stabilize tree based predictors are discussed
- …