46,433 research outputs found
Training Big Random Forests with Little Resources
Without access to large compute clusters, building random forests on large
datasets is still a challenging problem. This is, in particular, the case if
fully-grown trees are desired. We propose a simple yet effective framework that
allows to efficiently construct ensembles of huge trees for hundreds of
millions or even billions of training instances using a cheap desktop computer
with commodity hardware. The basic idea is to consider a multi-level
construction scheme, which builds top trees for small random subsets of the
available data and which subsequently distributes all training instances to the
top trees' leaves for further processing. While being conceptually simple, the
overall efficiency crucially depends on the particular implementation of the
different phases. The practical merits of our approach are demonstrated using
dense datasets with hundreds of millions of training instances.Comment: 9 pages, 9 Figure
Random Forests for Big Data
Big Data is one of the major challenges of statistical science and has
numerous consequences from algorithmic and theoretical viewpoints. Big Data
always involve massive data but they also often include online data and data
heterogeneity. Recently some statistical methods have been adapted to process
Big Data, like linear regression models, clustering methods and bootstrapping
schemes. Based on decision trees combined with aggregation and bootstrap ideas,
random forests were introduced by Breiman in 2001. They are a powerful
nonparametric statistical method allowing to consider in a single and versatile
framework regression problems, as well as two-class and multi-class
classification problems. Focusing on classification problems, this paper
proposes a selective review of available proposals that deal with scaling
random forests to Big Data problems. These proposals rely on parallel
environments or on online adaptations of random forests. We also describe how
related quantities -- such as out-of-bag error and variable importance -- are
addressed in these methods. Then, we formulate various remarks for random
forests in the Big Data context. Finally, we experiment five variants on two
massive datasets (15 and 120 millions of observations), a simulated one as well
as real world data. One variant relies on subsampling while three others are
related to parallel implementations of random forests and involve either
various adaptations of bootstrap to Big Data or to "divide-and-conquer"
approaches. The fifth variant relates on online learning of random forests.
These numerical experiments lead to highlight the relative performance of the
different variants, as well as some of their limitations
On PAC-Bayesian Bounds for Random Forests
Existing guarantees in terms of rigorous upper bounds on the generalization
error for the original random forest algorithm, one of the most frequently used
machine learning methods, are unsatisfying. We discuss and evaluate various
PAC-Bayesian approaches to derive such bounds. The bounds do not require
additional hold-out data, because the out-of-bag samples from the bagging in
the training process can be exploited. A random forest predicts by taking a
majority vote of an ensemble of decision trees. The first approach is to bound
the error of the vote by twice the error of the corresponding Gibbs classifier
(classifying with a single member of the ensemble selected at random). However,
this approach does not take into account the effect of averaging out of errors
of individual classifiers when taking the majority vote. This effect provides a
significant boost in performance when the errors are independent or negatively
correlated, but when the correlations are strong the advantage from taking the
majority vote is small. The second approach based on PAC-Bayesian C-bounds
takes dependencies between ensemble members into account, but it requires
estimating correlations between the errors of the individual classifiers. When
the correlations are high or the estimation is poor, the bounds degrade. In our
experiments, we compute generalization bounds for random forests on various
benchmark data sets. Because the individual decision trees already perform
well, their predictions are highly correlated and the C-bounds do not lead to
satisfactory results. For the same reason, the bounds based on the analysis of
Gibbs classifiers are typically superior and often reasonably tight. Bounds
based on a validation set coming at the cost of a smaller training set gave
better performance guarantees, but worse performance in most experiments
Exploring helical dynamos with machine learning
We use ensemble machine learning algorithms to study the evolution of
magnetic fields in magnetohydrodynamic (MHD) turbulence that is helically
forced. We perform direct numerical simulations of helically forced turbulence
using mean field formalism, with electromotive force (EMF) modeled both as a
linear and non-linear function of the mean magnetic field and current density.
The form of the EMF is determined using regularized linear regression and
random forests. We also compare various analytical models to the data using
Bayesian inference with Markov Chain Monte Carlo (MCMC) sampling. Our results
demonstrate that linear regression is largely successful at predicting the EMF
and the use of more sophisticated algorithms (random forests, MCMC) do not lead
to significant improvement in the fits. We conclude that the data we are
looking at is effectively low dimensional and essentially linear. Finally, to
encourage further exploration by the community, we provide all of our
simulation data and analysis scripts as open source IPython notebooks.Comment: accepted by A&A, 11 pages, 6 figures, 3 tables, data + IPython
notebooks: https://github.com/fnauman/ML_alpha
Review and Comparison of Intelligent Optimization Modelling Techniques for Energy Forecasting and Condition-Based Maintenance in PV Plants
Within the field of soft computing, intelligent optimization modelling techniques include
various major techniques in artificial intelligence. These techniques pretend to generate new business
knowledge transforming sets of "raw data" into business value. One of the principal applications of
these techniques is related to the design of predictive analytics for the improvement of advanced
CBM (condition-based maintenance) strategies and energy production forecasting. These advanced
techniques can be used to transform control system data, operational data and maintenance event data
to failure diagnostic and prognostic knowledge and, ultimately, to derive expected energy generation.
One of the systems where these techniques can be applied with massive potential impact are the
legacy monitoring systems existing in solar PV energy generation plants. These systems produce a
great amount of data over time, while at the same time they demand an important e ort in order to
increase their performance through the use of more accurate predictive analytics to reduce production
losses having a direct impact on ROI. How to choose the most suitable techniques to apply is one of
the problems to address. This paper presents a review and a comparative analysis of six intelligent
optimization modelling techniques, which have been applied on a PV plant case study, using the
energy production forecast as the decision variable. The methodology proposed not only pretends
to elicit the most accurate solution but also validates the results, in comparison with the di erent
outputs for the di erent techniques
- …