121,809 research outputs found
One Class Splitting Criteria for Random Forests
Random Forests (RFs) are strong machine learning tools for classification and regression. However, they remain supervised algorithms, and no extension of RFs to the one-class setting has been proposed, except for techniques based on second-class sampling. This work fills this gap by proposing a natural methodology to extend standard splitting criteria to the one-class setting, structurally generalizing RFs to one-class classification. An extensive benchmark of seven state-of-the-art anomaly detection algorithms is also presented. This empirically demonstrates the relevance of our approach
Random Forests for Big Data
Big Data is one of the major challenges of statistical science and has
numerous consequences from algorithmic and theoretical viewpoints. Big Data
always involve massive data but they also often include online data and data
heterogeneity. Recently some statistical methods have been adapted to process
Big Data, like linear regression models, clustering methods and bootstrapping
schemes. Based on decision trees combined with aggregation and bootstrap ideas,
random forests were introduced by Breiman in 2001. They are a powerful
nonparametric statistical method allowing to consider in a single and versatile
framework regression problems, as well as two-class and multi-class
classification problems. Focusing on classification problems, this paper
proposes a selective review of available proposals that deal with scaling
random forests to Big Data problems. These proposals rely on parallel
environments or on online adaptations of random forests. We also describe how
related quantities -- such as out-of-bag error and variable importance -- are
addressed in these methods. Then, we formulate various remarks for random
forests in the Big Data context. Finally, we experiment five variants on two
massive datasets (15 and 120 millions of observations), a simulated one as well
as real world data. One variant relies on subsampling while three others are
related to parallel implementations of random forests and involve either
various adaptations of bootstrap to Big Data or to "divide-and-conquer"
approaches. The fifth variant relates on online learning of random forests.
These numerical experiments lead to highlight the relative performance of the
different variants, as well as some of their limitations
Minimax Rates for High-Dimensional Random Tessellation Forests
Random forests are a popular class of algorithms used for regression and
classification. The algorithm introduced by Breiman in 2001 and many of its
variants are ensembles of randomized decision trees built from axis-aligned
partitions of the feature space. One such variant, called Mondrian forests, was
proposed to handle the online setting and is the first class of random forests
for which minimax rates were obtained in arbitrary dimension. However, the
restriction to axis-aligned splits fails to capture dependencies between
features, and random forests that use oblique splits have shown improved
empirical performance for many tasks. In this work, we show that a large class
of random forests with general split directions also achieve minimax rates in
arbitrary dimension. This class includes STIT forests, a generalization of
Mondrian forests to arbitrary split directions, as well as random forests
derived from Poisson hyperplane tessellations. These are the first results
showing that random forest variants with oblique splits can obtain minimax
optimality in arbitrary dimension. Our proof technique relies on the novel
application of the theory of stationary random tessellations in stochastic
geometry to statistical learning theory.Comment: 20 page
Logical limit laws for minor-closed classes of graphs
Let be an addable, minor-closed class of graphs. We prove that
the zero-one law holds in monadic second-order logic (MSO) for the random graph
drawn uniformly at random from all {\em connected} graphs in on
vertices, and the convergence law in MSO holds if we draw uniformly at
random from all graphs in on vertices. We also prove analogues
of these results for the class of graphs embeddable on a fixed surface,
provided we restrict attention to first order logic (FO). Moreover, the
limiting probability that a given FO sentence is satisfied is independent of
the surface . We also prove that the closure of the set of limiting
probabilities is always the finite union of at least two disjoint intervals,
and that it is the same for FO and MSO. For the classes of forests and planar
graphs we are able to determine the closure of the set of limiting
probabilities precisely. For planar graphs it consists of exactly 108
intervals, each of length . Finally, we analyse
examples of non-addable classes where the behaviour is quite different. For
instance, the zero-one law does not hold for the random caterpillar on
vertices, even in FO.Comment: minor changes; accepted for publication by JCT
Morphological identification of Bighead Carp, Silver Carp, and Grass Carp eggs using random forests machine learning classification
Visual identification of fish eggs is difficult and unreliable due to a lack of information on the morphological egg characteristics of many species. We used random forests machine learning to predict the identity of genetically identified Bighead Carp Hypophthalmichthys nobilis, Grass Carp Ctenopharyngodon idella, and Silver Carp H. molitrix eggs based on egg morphometric and environmental characteristics. Family, genus, and species taxonomic-level random forests models were explored to assess the performance and accuracy of the predictor variables. The egg characteristics of Bighead Carp, Grass Carp, and Silver Carp were similar, and they were difficult to distinguish from one another. When combined into a single invasive carp class, the random forests models were ≥ 97% accurate at identifying invasive carp eggs, with a ≤5% false positive rate. Egg membrane diameter was the most important predictive variable, but the addition of ten other variables resulted in a 98% success rate for identifying invasive carp eggs from 26 other upper Mississippi River basin species. Our results revealed that a combination of morphometric and environmental measurements can be used to identify invasive carp eggs. Similar machine learning approaches could be used to identify the eggs of other fishes. These results will help managers more easily and quickly assess invasive carp reproduction
Ordinal Forests
The prediction of the values of ordinal response variables using covariate data is a relatively infrequent task in many application areas. Accordingly, ordinal response variables have gained comparably little attention in the literature on statistical prediction modeling. The random forest method is one of the strongest prediction methods for binary response variables and continuous response variables. Its basic, tree-based concept has led to several extensions including prediction methods for other types of response variables.
In this paper, the ordinal forest method is introduced, a random forest based prediction method for ordinal response variables. Ordinal forests allow prediction using both low-dimensional and high-dimensional covariate data and can additionally be used to rank covariates with respect to their importance for prediction.
Using several real datasets and simulated data, the performance of ordinal forests with respect to prediction and covariate importance ranking is compared to competing approaches. First, these investigations reveal that ordinal forests tend to outperform competitors in terms of prediction performance. Second, it is seen that the covariate importance measure currently used by ordinal forest discriminates influential covariates from noise covariates at least similarly well as the measures used by competitors. In an additional investigation using simulated data, several further important properties of the OF algorithm are studied.
The rationale underlying ordinal forests to use optimized score values in place of the class values of the ordinal response variable is in principle applicable to any regression method beyond random forests for continuous outcome that is considered in the ordinal forest method
- …