2,327 research outputs found
COMET: A Recipe for Learning and Using Large Ensembles on Massive Data
COMET is a single-pass MapReduce algorithm for learning on large-scale data.
It builds multiple random forest ensembles on distributed blocks of data and
merges them into a mega-ensemble. This approach is appropriate when learning
from massive-scale data that is too large to fit on a single machine. To get
the best accuracy, IVoting should be used instead of bagging to generate the
training subset for each decision tree in the random forest. Experiments with
two large datasets (5GB and 50GB compressed) show that COMET compares favorably
(in both accuracy and training time) to learning on a subsample of data using a
serial algorithm. Finally, we propose a new Gaussian approach for lazy ensemble
evaluation which dynamically decides how many ensemble members to evaluate per
data point; this can reduce evaluation cost by 100X or more
Symmetry in Critical Random Boolean Network Dynamics
Using Boolean networks as prototypical examples, the role of symmetry in the
dynamics of heterogeneous complex systems is explored. We show that symmetry of
the dynamics, especially in critical states, is a controlling feature that can
be used both to greatly simplify analysis and to characterize different types
of dynamics. Symmetry in Boolean networks is found by determining the frequency
at which the various Boolean output functions occur. There are classes of
functions that consist of Boolean functions that behave similarly. These
classes are orbits of the controlling symmetry group. We find that the symmetry
that controls the critical random Boolean networks is expressed through the
frequency by which output functions are utilized by nodes that remain active on
dynamical attractors. This symmetry preserves canalization, a form of network
robustness. We compare it to a different symmetry known to control the dynamics
of an evolutionary process that allows Boolean networks to organize into a
critical state. Our results demonstrate the usefulness and power of using the
symmetry of the behavior of the nodes to characterize complex network dynamics,
and introduce a novel approach to the analysis of heterogeneous complex
systems
Ensemble Pruning for Glaucoma Detection in an Unbalanced Data Set
Background: Random forests are successful classifier ensemble methods consisting of typically 100 to 1000 classification trees. Ensemble pruning techniques reduce the computational cost, especially the memory demand, of random forests by reducing the number of trees without relevant loss of performance or even with increased performance of the sub-ensemble. The application to the problem of an early detection of glaucoma, a severe eye disease with low prevalence, based on topographical measurements of the eye background faces specific challenges. Objectives: We examine the performance of ensemble pruning strategies for glaucoma detection in an unbalanced data situation. Methods: The data set consists of 102 topographical features of the eye background of 254 healthy controls and 55 glaucoma patients. We compare the area under the receiver operating characteristic curve (AUC), and the Brier score on the total data set, in the majority class, and in the minority class of pruned random forest ensembles obtained with strategies based on the prediction accuracy of greedily grown sub-ensembles, the uncertainty weighted accuracy, and the similarity between single trees. To validate the findings and to examine the influence of the prevalence of glaucoma in the data set, we additionally perform a simulation study with lower prevalences of glaucoma. Results: In glaucoma classification all three pruning strategies lead to improved AUC and smaller Brier scores on the total data set with sub-ensembles as small as 30 to 80 trees compared to the classification results obtained with the full ensemble consisting of 1000 trees. In the simulation study, we were able to show that the prevalence of glaucoma is a critical factor and lower prevalence decreases the performance of our pruning strategies. Conclusions: The memory demand for glaucoma classification in an unbalanced data situation based on random forests could effectively be reduced by the application of pruning strategies without loss of performance in a population with increased risk of glaucoma
An outlier ranking tree selection approach to extreme pruning of random forests.
Random Forest (RF) is an ensemble classification technique that was developed by Breiman over a decade ago. Compared with other ensemble techniques, it has proved its accuracy and superiority. Many researchers, however, believe that there is still room for enhancing and improving its performance in terms of predictive accuracy. This explains why, over the past decade, there have been many extensions of RF where each extension employed a variety of techniques and strategies to improve certain aspect(s) of RF. Since it has been proven empirically that ensembles tend to yield better results when there is a significant diversity among the constituent models, the objective of this paper is twofold. First, it investigates how an unsupervised learning technique, namely, Local Outlier Factor (LOF) can be used to identify diverse trees in the RF. Second, trees with the highest LOF scores are then used to create a new RF termed LOFB-DRF that is much smaller in size than RF, and yet performs at least as good as RF, but mostly exhibits higher performance in terms of accuracy. The latter refers to a known technique called ensemble pruning. Experimental results on 10 real datasets prove the superiority of our proposed method over the traditional RF. Unprecedented pruning levels reaching as high as 99% have been achieved at the time of boosting the predictive accuracy of the ensemble. The notably extreme pruning level makes the technique a good candidate for real-time applications
A Diversity-Accuracy Measure for Homogenous Ensemble Selection
Several selection methods in the literature are essentially based on an evaluation function that determines whether a model M contributes positively to boost the performances of the whole ensemble. In this paper, we propose a method called DIversity and ACcuracy for Ensemble Selection (DIACES) using an evaluation function based on both diversity and accuracy. The method is applied on homogenous ensembles composed of C4.5 decision trees and based on a hill climbing strategy. This allows selecting ensembles with the best compromise between maximum diversity and minimum error rate. Comparative studies show that in most cases the proposed method generates reduced size ensembles with better performances than usual ensemble simplification methods
Classy Ensemble: A Novel Ensemble Algorithm for Classification
We present Classy Ensemble, a novel ensemble-generation algorithm for
classification tasks, which aggregates models through a weighted combination of
per-class accuracy. Tested over 153 machine learning datasets we demonstrate
that Classy Ensemble outperforms two other well-known aggregation algorithms --
order-based pruning and clustering-based pruning -- as well as the recently
introduced lexigarden ensemble generator. We then present three enhancements:
1) Classy Cluster Ensemble, which combines Classy Ensemble and cluster-based
pruning; 2) Deep Learning experiments, showing the merits of Classy Ensemble
over four image datasets: Fashion MNIST, CIFAR10, CIFAR100, and ImageNet; and
3) Classy Evolutionary Ensemble, wherein an evolutionary algorithm is used to
select the set of models which Classy Ensemble picks from. This latter,
combining learning and evolution, resulted in improved performance on the
hardest dataset
Increasing Fairness in Compromise on Accuracy via Weighted Vote with Learning Guarantees
As the bias issue is being taken more and more seriously in widely applied
machine learning systems, the decrease in accuracy in most cases deeply
disturbs researchers when increasing fairness. To address this problem, we
present a novel analysis of the expected fairness quality via weighted vote,
suitable for both binary and multi-class classification. The analysis takes the
correction of biased predictions by ensemble members into account and provides
learning bounds that are amenable to efficient minimisation. We further propose
a pruning method based on this analysis and the concepts of domination and
Pareto optimality, which is able to increase fairness under a prerequisite of
little or even no accuracy decline. The experimental results indicate that the
proposed learning bounds are faithful and that the proposed pruning method can
indeed increase ensemble fairness without much accuracy degradation.Comment: 18 pages, 15 figures, and 6 table
- …