38 research outputs found
Training Big Random Forests with Little Resources
Without access to large compute clusters, building random forests on large
datasets is still a challenging problem. This is, in particular, the case if
fully-grown trees are desired. We propose a simple yet effective framework that
allows to efficiently construct ensembles of huge trees for hundreds of
millions or even billions of training instances using a cheap desktop computer
with commodity hardware. The basic idea is to consider a multi-level
construction scheme, which builds top trees for small random subsets of the
available data and which subsequently distributes all training instances to the
top trees' leaves for further processing. While being conceptually simple, the
overall efficiency crucially depends on the particular implementation of the
different phases. The practical merits of our approach are demonstrated using
dense datasets with hundreds of millions of training instances.Comment: 9 pages, 9 Figure
A general guide to applying machine learning to computer architecture
The resurgence of machine learning since the late 1990s has been enabled by significant advances in computing performance and the growth of big data. The ability of these algorithms to detect complex patterns in data which are extremely difficult to achieve manually, helps to produce effective predictive models. Whilst computer architects have been accelerating the performance of machine learning algorithms with GPUs and custom hardware, there have been few implementations leveraging these algorithms to improve the computer system performance. The work that has been conducted, however, has produced considerably promising results.
The purpose of this paper is to serve as a foundational base and guide to future computer
architecture research seeking to make use of machine learning models for improving system efficiency.
We describe a method that highlights when, why, and how to utilize machine learning
models for improving system performance and provide a relevant example showcasing the effectiveness of applying machine learning in computer architecture. We describe a process of data
generation every execution quantum and parameter engineering. This is followed by a survey of a
set of popular machine learning models. We discuss their strengths and weaknesses and provide
an evaluation of implementations for the purpose of creating a workload performance predictor
for different core types in an x86 processor. The predictions can then be exploited by a scheduler
for heterogeneous processors to improve the system throughput. The algorithms of focus are
stochastic gradient descent based linear regression, decision trees, random forests, artificial neural
networks, and k-nearest neighbors.This work has been supported by the European Research Council (ERC) Advanced Grant RoMoL (Grant Agreemnt 321253) and by the Spanish Ministry of Science and Innovation (contract TIN 2015-65316P).Peer ReviewedPostprint (published version
Understanding Random Forests: From Theory to Practice
Data analysis and machine learning have become an integrative part of the
modern scientific methodology, offering automated procedures for the prediction
of a phenomenon based on past observations, unraveling underlying patterns in
data and providing insights about the problem. Yet, caution should avoid using
machine learning as a black-box tool, but rather consider it as a methodology,
with a rational thought process that is entirely dependent on the problem under
study. In particular, the use of algorithms should ideally require a reasonable
understanding of their mechanisms, properties and limitations, in order to
better apprehend and interpret their results.
Accordingly, the goal of this thesis is to provide an in-depth analysis of
random forests, consistently calling into question each and every part of the
algorithm, in order to shed new light on its learning capabilities, inner
workings and interpretability. The first part of this work studies the
induction of decision trees and the construction of ensembles of randomized
trees, motivating their design and purpose whenever possible. Our contributions
follow with an original complexity analysis of random forests, showing their
good computational performance and scalability, along with an in-depth
discussion of their implementation details, as contributed within Scikit-Learn.
In the second part of this work, we analyse and discuss the interpretability
of random forests in the eyes of variable importance measures. The core of our
contributions rests in the theoretical characterization of the Mean Decrease of
Impurity variable importance measure, from which we prove and derive some of
its properties in the case of multiway totally randomized trees and in
asymptotic conditions. In consequence of this work, our analysis demonstrates
that variable importances [...].Comment: PhD thesis. Source code available at
https://github.com/glouppe/phd-thesi
Ensemble of Example-Dependent Cost-Sensitive Decision Trees
Several real-world classification problems are example-dependent
cost-sensitive in nature, where the costs due to misclassification vary between
examples and not only within classes. However, standard classification methods
do not take these costs into account, and assume a constant cost of
misclassification errors. In previous works, some methods that take into
account the financial costs into the training of different algorithms have been
proposed, with the example-dependent cost-sensitive decision tree algorithm
being the one that gives the highest savings. In this paper we propose a new
framework of ensembles of example-dependent cost-sensitive decision-trees. The
framework consists in creating different example-dependent cost-sensitive
decision trees on random subsamples of the training set, and then combining
them using three different combination approaches. Moreover, we propose two new
cost-sensitive combination approaches; cost-sensitive weighted voting and
cost-sensitive stacking, the latter being based on the cost-sensitive logistic
regression method. Finally, using five different databases, from four
real-world applications: credit card fraud detection, churn modeling, credit
scoring and direct marketing, we evaluate the proposed method against
state-of-the-art example-dependent cost-sensitive techniques, namely,
cost-proportionate sampling, Bayes minimum risk and cost-sensitive decision
trees. The results show that the proposed algorithms have better results for
all databases, in the sense of higher savings.Comment: 13 pages, 6 figures, Submitted for possible publicatio
Ensemble learning for electricity consumption forecasting in office buildings
This paper presents three ensemble learning models for short term load forecasting. Machine learning has evolved quickly in recent years, leading to novel and advanced models that are improving the forecasting results in multiple fields. However, in highly dynamic fields such as power and energy systems, dealing with the fast acquisition of large amounts of data from multiple data sources and taking advantage from the correlation between the multiple available variables is a challenging task, for which current models are not prepared. Ensemble learning is bringing promising results in this sense, as, by combining the results and use of multiple learners, is able to find new ways for current learning models to be used and optimized. In this paper three ensemble learning models are developed and the respective results compared: gradient boosted regression trees, random forests and an adaptation of Adaboost. Results for electricity consumption forecasting in hour-ahead are presented using a case-study based on real data from an office building. Results show that the adapted Adaboost model outperforms the reference models for hour-ahead load forecasting.This work has been developed under the SPET project - PTDC/EEI-EEE/29165/2017 and has received funding from UID/EEA/00760/2019, funded by FEDER Funds through COMPETE andby National Funds through FCTinfo:eu-repo/semantics/publishedVersio