121,809 research outputs found

    One Class Splitting Criteria for Random Forests

    Get PDF
    Random Forests (RFs) are strong machine learning tools for classification and regression. However, they remain supervised algorithms, and no extension of RFs to the one-class setting has been proposed, except for techniques based on second-class sampling. This work fills this gap by proposing a natural methodology to extend standard splitting criteria to the one-class setting, structurally generalizing RFs to one-class classification. An extensive benchmark of seven state-of-the-art anomaly detection algorithms is also presented. This empirically demonstrates the relevance of our approach

    Random Forests for Big Data

    Get PDF
    Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include online data and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper proposes a selective review of available proposals that deal with scaling random forests to Big Data problems. These proposals rely on parallel environments or on online adaptations of random forests. We also describe how related quantities -- such as out-of-bag error and variable importance -- are addressed in these methods. Then, we formulate various remarks for random forests in the Big Data context. Finally, we experiment five variants on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data. One variant relies on subsampling while three others are related to parallel implementations of random forests and involve either various adaptations of bootstrap to Big Data or to "divide-and-conquer" approaches. The fifth variant relates on online learning of random forests. These numerical experiments lead to highlight the relative performance of the different variants, as well as some of their limitations

    Minimax Rates for High-Dimensional Random Tessellation Forests

    Full text link
    Random forests are a popular class of algorithms used for regression and classification. The algorithm introduced by Breiman in 2001 and many of its variants are ensembles of randomized decision trees built from axis-aligned partitions of the feature space. One such variant, called Mondrian forests, was proposed to handle the online setting and is the first class of random forests for which minimax rates were obtained in arbitrary dimension. However, the restriction to axis-aligned splits fails to capture dependencies between features, and random forests that use oblique splits have shown improved empirical performance for many tasks. In this work, we show that a large class of random forests with general split directions also achieve minimax rates in arbitrary dimension. This class includes STIT forests, a generalization of Mondrian forests to arbitrary split directions, as well as random forests derived from Poisson hyperplane tessellations. These are the first results showing that random forest variants with oblique splits can obtain minimax optimality in arbitrary dimension. Our proof technique relies on the novel application of the theory of stationary random tessellations in stochastic geometry to statistical learning theory.Comment: 20 page

    Logical limit laws for minor-closed classes of graphs

    Get PDF
    Let G\mathcal G be an addable, minor-closed class of graphs. We prove that the zero-one law holds in monadic second-order logic (MSO) for the random graph drawn uniformly at random from all {\em connected} graphs in G\mathcal G on nn vertices, and the convergence law in MSO holds if we draw uniformly at random from all graphs in G\mathcal G on nn vertices. We also prove analogues of these results for the class of graphs embeddable on a fixed surface, provided we restrict attention to first order logic (FO). Moreover, the limiting probability that a given FO sentence is satisfied is independent of the surface SS. We also prove that the closure of the set of limiting probabilities is always the finite union of at least two disjoint intervals, and that it is the same for FO and MSO. For the classes of forests and planar graphs we are able to determine the closure of the set of limiting probabilities precisely. For planar graphs it consists of exactly 108 intervals, each of length ≈5⋅10−6\approx 5\cdot 10^{-6}. Finally, we analyse examples of non-addable classes where the behaviour is quite different. For instance, the zero-one law does not hold for the random caterpillar on nn vertices, even in FO.Comment: minor changes; accepted for publication by JCT

    Morphological identification of Bighead Carp, Silver Carp, and Grass Carp eggs using random forests machine learning classification

    Get PDF
    Visual identification of fish eggs is difficult and unreliable due to a lack of information on the morphological egg characteristics of many species. We used random forests machine learning to predict the identity of genetically identified Bighead Carp Hypophthalmichthys nobilis, Grass Carp Ctenopharyngodon idella, and Silver Carp H. molitrix eggs based on egg morphometric and environmental characteristics. Family, genus, and species taxonomic-level random forests models were explored to assess the performance and accuracy of the predictor variables. The egg characteristics of Bighead Carp, Grass Carp, and Silver Carp were similar, and they were difficult to distinguish from one another. When combined into a single invasive carp class, the random forests models were ≥ 97% accurate at identifying invasive carp eggs, with a ≤5% false positive rate. Egg membrane diameter was the most important predictive variable, but the addition of ten other variables resulted in a 98% success rate for identifying invasive carp eggs from 26 other upper Mississippi River basin species. Our results revealed that a combination of morphometric and environmental measurements can be used to identify invasive carp eggs. Similar machine learning approaches could be used to identify the eggs of other fishes. These results will help managers more easily and quickly assess invasive carp reproduction

    Ordinal Forests

    Get PDF
    The prediction of the values of ordinal response variables using covariate data is a relatively infrequent task in many application areas. Accordingly, ordinal response variables have gained comparably little attention in the literature on statistical prediction modeling. The random forest method is one of the strongest prediction methods for binary response variables and continuous response variables. Its basic, tree-based concept has led to several extensions including prediction methods for other types of response variables. In this paper, the ordinal forest method is introduced, a random forest based prediction method for ordinal response variables. Ordinal forests allow prediction using both low-dimensional and high-dimensional covariate data and can additionally be used to rank covariates with respect to their importance for prediction. Using several real datasets and simulated data, the performance of ordinal forests with respect to prediction and covariate importance ranking is compared to competing approaches. First, these investigations reveal that ordinal forests tend to outperform competitors in terms of prediction performance. Second, it is seen that the covariate importance measure currently used by ordinal forest discriminates influential covariates from noise covariates at least similarly well as the measures used by competitors. In an additional investigation using simulated data, several further important properties of the OF algorithm are studied. The rationale underlying ordinal forests to use optimized score values in place of the class values of the ordinal response variable is in principle applicable to any regression method beyond random forests for continuous outcome that is considered in the ordinal forest method
    • …
    corecore