509 research outputs found
Context-dependent feature analysis with random forests
In many cases, feature selection is often more complicated than identifying a
single subset of input variables that would together explain the output. There
may be interactions that depend on contextual information, i.e., variables that
reveal to be relevant only in some specific circumstances. In this setting, the
contribution of this paper is to extend the random forest variable importances
framework in order (i) to identify variables whose relevance is
context-dependent and (ii) to characterize as precisely as possible the effect
of contextual information on these variables. The usage and the relevance of
our framework for highlighting context-dependent variables is illustrated on
both artificial and real datasets.Comment: Accepted for presentation at UAI 201
Understanding Random Forests: From Theory to Practice
Data analysis and machine learning have become an integrative part of the
modern scientific methodology, offering automated procedures for the prediction
of a phenomenon based on past observations, unraveling underlying patterns in
data and providing insights about the problem. Yet, caution should avoid using
machine learning as a black-box tool, but rather consider it as a methodology,
with a rational thought process that is entirely dependent on the problem under
study. In particular, the use of algorithms should ideally require a reasonable
understanding of their mechanisms, properties and limitations, in order to
better apprehend and interpret their results.
Accordingly, the goal of this thesis is to provide an in-depth analysis of
random forests, consistently calling into question each and every part of the
algorithm, in order to shed new light on its learning capabilities, inner
workings and interpretability. The first part of this work studies the
induction of decision trees and the construction of ensembles of randomized
trees, motivating their design and purpose whenever possible. Our contributions
follow with an original complexity analysis of random forests, showing their
good computational performance and scalability, along with an in-depth
discussion of their implementation details, as contributed within Scikit-Learn.
In the second part of this work, we analyse and discuss the interpretability
of random forests in the eyes of variable importance measures. The core of our
contributions rests in the theoretical characterization of the Mean Decrease of
Impurity variable importance measure, from which we prove and derive some of
its properties in the case of multiway totally randomized trees and in
asymptotic conditions. In consequence of this work, our analysis demonstrates
that variable importances [...].Comment: PhD thesis. Source code available at
https://github.com/glouppe/phd-thesi
Exploring helical dynamos with machine learning
We use ensemble machine learning algorithms to study the evolution of
magnetic fields in magnetohydrodynamic (MHD) turbulence that is helically
forced. We perform direct numerical simulations of helically forced turbulence
using mean field formalism, with electromotive force (EMF) modeled both as a
linear and non-linear function of the mean magnetic field and current density.
The form of the EMF is determined using regularized linear regression and
random forests. We also compare various analytical models to the data using
Bayesian inference with Markov Chain Monte Carlo (MCMC) sampling. Our results
demonstrate that linear regression is largely successful at predicting the EMF
and the use of more sophisticated algorithms (random forests, MCMC) do not lead
to significant improvement in the fits. We conclude that the data we are
looking at is effectively low dimensional and essentially linear. Finally, to
encourage further exploration by the community, we provide all of our
simulation data and analysis scripts as open source IPython notebooks.Comment: accepted by A&A, 11 pages, 6 figures, 3 tables, data + IPython
notebooks: https://github.com/fnauman/ML_alpha
Sparsity Oriented Importance Learning for High-dimensional Linear Regression
With now well-recognized non-negligible model selection uncertainty, data
analysts should no longer be satisfied with the output of a single final model
from a model selection process, regardless of its sophistication. To improve
reliability and reproducibility in model choice, one constructive approach is
to make good use of a sound variable importance measure. Although interesting
importance measures are available and increasingly used in data analysis,
little theoretical justification has been done. In this paper, we propose a new
variable importance measure, sparsity oriented importance learning (SOIL), for
high-dimensional regression from a sparse linear modeling perspective by taking
into account the variable selection uncertainty via the use of a sensible model
weighting. The SOIL method is theoretically shown to have the
inclusion/exclusion property: When the model weights are properly around the
true model, the SOIL importance can well separate the variables in the true
model from the rest. In particular, even if the signal is weak, SOIL rarely
gives variables not in the true model significantly higher important values
than those in the true model. Extensive simulations in several illustrative
settings and real data examples with guided simulations show desirable
properties of the SOIL importance in contrast to other importance measures
From global to local MDI variable importances for random forests and when they are Shapley values
peer reviewedRandom forests have been widely used for their ability to provide so-called importance measures, which give insight at a global (per dataset) level on the relevance of input variables to predict a certain output. On the other hand, methods based on Shapley values have been introduced to refine the analysis of feature relevance in tree-based models to a local (per instance) level. In this context, we first show that the global Mean Decrease of Impurity (MDI) variable importance scores correspond to Shapley values under some conditions. Then, we derive a local MDI importance measure of variable relevance, which has a very natural connection with the global MDI measure and can be related to a new notion of local feature relevance. We further link local MDI importances with Shapley values and discuss them in the light of related measures from the literature. The measures are illustrated through experiments on several classification and regression problems
Modeling Customer Engagement with churn and upgrade prediction
Modeling customer engagement assists a business in identifying the high risk and high potential
customers. A way to define high risk and high potential customers in a Software-as-a-Service (SaaS)
business is to define them as customers with high potential to churn or upgrade. Identifying the
high risk and high potential customers in time can help the business retain and grow revenue.
This thesis uses churn and upgrade prediction classifiers to define a customer engagement score for
a SaaS business. The classifiers used and compared in the research were logistic regression, random
forest and XGBoost. The classifiers were trained using data from the case-company containing
customer data such as user count and feature usage. To tackle class imbalance, the models were
also trained with oversampled training data. The hyperparameters of each classifier were optimised
using grid search. After training the models, performance of the classifiers on a test data was
evaluated.
In the end, the XGBoost classifiers outperformed the other classifiers in churn prediction. In predicting customer upgrades, the results were more mixed. Feature importances were also calculated,
and the results showed that the importances differ for churn and upgrade prediction
SPLIT DECISIONS: PRACTICAL MACHINE LEARNING FOR EMPIRICAL LEGAL SCHOLARSHIP
Multivariable regression may be the most prevalent and useful
task in social science. Empirical legal studies rely heavily on the
ordinary least squares method. Conventional regression methods
have attained credibility in court, but by no means do they dictate
legal outcomes. Using the iconic Boston housing study as a source of
price data, this Article introduces machine-learning regression
methods. Although decision trees and forest ensembles lack the overt
interpretability of linear regression, these methods reduce the opacity
of black-box techniques by scoring the relative importance of dataset
features. This Article will also address the theoretical tradeoff
between bias and variance, as well as the importance of training,
cross-validation, and reserving a holdout dataset for testing
- …