317 research outputs found

    Statistical Sources of Variable Selection Bias in Classification Tree Algorithms Based on the Gini Index

    Get PDF
    Evidence for variable selection bias in classification tree algorithms based on the Gini Index is reviewed from the literature and embedded into a broader explanatory scheme: Variable selection bias in classification tree algorithms based on the Gini Index can be caused not only by the statistical effect of multiple comparisons, but also by an increasing estimation bias and variance of the splitting criterion when plug-in estimates of entropy measures like the Gini Index are employed. The relevance of these sources of variable selection bias in the different simulation study designs is examined. Variable selection bias due to the explored sources applies to all classification tree algorithms based on empirical entropy measures like the Gini Index, Deviance and Information Gain, and to both binary and multiway splitting algorithms

    Complexes of not ii-connected graphs

    Full text link
    Complexes of (not) connected graphs, hypergraphs and their homology appear in the construction of knot invariants given by V. Vassiliev. In this paper we study the complexes of not ii-connected kk-hypergraphs on nn vertices. We show that the complex of not 22-connected graphs has the homotopy type of a wedge of (n2)!(n-2)! spheres of dimension 2n52n-5. This answers one of the questions raised by Vassiliev in connection with knot invariants. For this case the SnS_n-action on the homology of the complex is also determined. For complexes of not 22-connected kk-hypergraphs we provide a formula for the generating function of the Euler characteristic, and we introduce certain lattices of graphs that encode their topology. We also present partial results for some other cases. In particular, we show that the complex of not (n2)(n-2)-connected graphs is Alexander dual to the complex of partial matchings of the complete graph. For not (n3)(n-3)-connected graphs we provide a formula for the generating function of the Euler characteristic

    On partitioning multivariate self-affine time series

    Get PDF
    Given a multivariate time series, possibly of high dimension, with unknown and time-varying joint distribution, it is of interest to be able to completely partition the time series into disjoint, contiguous subseries, each of which has different distributional or pattern attributes from the preceding and succeeding subseries. An additional feature of many time series is that they display self-affinity, so that subseries at one time scale are similar to subseries at another after application of an affine transformation. Such qualities are observed in time series from many disciplines, including biology, medicine, economics, finance, and computer science. This paper defines the relevant multiobjective combinatorial optimization problem with limited assumptions as a biobjective one, and a specialized evolutionary algorithm is presented which finds optimal self-affine time series partitionings with a minimum of choice parameters. The algorithm not only finds partitionings for all possible numbers of partitions given data constraints, but also for self-affinities between these partitionings and some fine-grained partitioning. The resulting set of Pareto-efficient solution sets provides a rich representation of the self-affine properties of a multivariate time series at different locations and time scales

    Unbiased split selection for classification trees based on the Gini Index

    Get PDF
    The Gini gain is one of the most common variable selection criteria in machine learning. We derive the exact distribution of the maximally selected Gini gain in the context of binary classification using continuous predictors by means of a combinatorial approach. This distribution provides a formal support for variable selection bias in favor of variables with a high amount of missing values when the Gini gain is used as split selection criterion, and we suggest to use the resulting p-value as an unbiased split selection criterion in recursive partitioning algorithms. We demonstrate the efficiency of our novel method in simulation- and real data- studies from veterinary gynecology in the context of binary classification and continuous predictor variables with different numbers of missing values. Our method is extendible to categorical and ordinal predictor variables and to other split selection criteria such as the cross-entropy criterion

    On partitioning multivariate self-affine time series

    Get PDF
    Given a multivariate time series, possibly of high dimension, with unknown and time-varying joint distribution, it is of interest to be able to completely partition the time series into disjoint, contiguous subseries, each of which has different distributional or pattern attributes from the preceding and succeeding subseries. An additional feature of many time series is that they display self-affinity, so that subseries at one time scale are similar to subseries at another after application of an affine transformation. Such qualities are observed in time series from many disciplines, including biology, medicine, economics, finance, and computer science. This paper defines the relevant multiobjective combinatorial optimization problem with limited assumptions as a biobjective one, and a specialized evolutionary algorithm is presented which finds optimal self-affine time series partitionings with a minimum of choice parameters. The algorithm not only finds partitionings for all possible numbers of partitions given data constraints, but also for self-affinities between these partitionings and some fine-grained partitioning. The resulting set of Pareto-efficient solution sets provides a rich representation of the self-affine properties of a multivariate time series at different locations and time scales

    Tree stability diagnostics and some remedies against instability

    Get PDF
    Stability aspects of recursive partitioning procedures are investigated. Using resampling techniques, diagnostic tools to assess single split stability and overall tree stability are introduced. To correct for the procedure's preference for covariates with many unique realizations, corrected p-values are used in the factor selection component of the algorithm. Finally, methods to stabilize tree based predictors are discussed
    corecore