2,414 research outputs found
Distributional Random Forests: Heterogeneity Adjustment and Multivariate Distributional Regression
Random Forests (Breiman, 2001) is a successful and widely used regression and
classification algorithm. Part of its appeal and reason for its versatility is
its (implicit) construction of a kernel-type weighting function on training
data, which can also be used for targets other than the original mean
estimation. We propose a novel forest construction for multivariate responses
based on their joint conditional distribution, independent of the estimation
target and the data model. It uses a new splitting criterion based on the MMD
distributional metric, which is suitable for detecting heterogeneity in
multivariate distributions. The induced weights define an estimate of the full
conditional distribution, which in turn can be used for arbitrary and
potentially complicated targets of interest. The method is very versatile and
convenient to use, as we illustrate on a wide range of examples. The code is
available as Python and R packages drf
A branch & bound algorithm to determine optimal bivariate splits for oblique decision tree induction
Univariate decision tree induction methods for multiclass classification problems such as CART, C4.5 and ID3 continue to be very popular in the context of machine learning due to their major benefit of being easy to interpret. However, as these trees only consider a single attribute per node, they often get quite large which lowers their explanatory value. Oblique decision tree building algorithms, which divide the feature space by multidimensional hyperplanes, often produce much smaller trees but the individual splits are hard to interpret. Moreover, the effort of finding optimal oblique splits is very high such that heuristics have to be applied to determine local optimal solutions. In this work, we introduce an effective branch and bound procedure to determine global optimal bivariate oblique splits for concave impurity measures. Decision trees based on these bivariate oblique splits remain fairly interpretable due to the restriction to two attributes per split. The resulting trees are significantly smaller and more accurate than their univariate counterparts due to their ability of adapting better to the underlying data and capturing interactions of attribute pairs. Moreover, our evaluation shows that our algorithm
even outperforms algorithms based on heuristically obtained multivariate oblique splits despite the fact that we are focusing on two attributes only
Optimization algorithms for decision tree induction
Aufgrund der guten Interpretierbarkeit gehören Entscheidungsbäume zu den am häufigsten verwendeten Modellen des maschinellen Lernens zur Lösung von Klassifizierungs- und Regressionsaufgaben. Ihre Vorhersagen sind oft jedoch nicht so genau wie die anderer Modelle.
Der am weitesten verbreitete Ansatz zum Lernen von Entscheidungsbäumen ist die
Top-Down-Methode, bei der rekursiv neue Aufteilungen anhand eines einzelnen Merkmals eingefuhrt werden, die ein bestimmtes Aufteilungskriterium minimieren. Eine Möglichkeit diese Strategie zu verbessern und kleinere und genauere Entscheidungsbäume
zu erzeugen, besteht darin, andere Arten von Aufteilungen zuzulassen, z.B. welche, die
mehrere Merkmale gleichzeitig berücksichtigen. Solche zu bestimmen ist allerdings deutlich komplexer und es sind effektive Optimierungsalgorithmen notwendig um optimale
Lösungen zu finden.
Für numerische Merkmale sind Aufteilungen anhand affiner Hyperebenen eine Alternative zu univariaten Aufteilungen. Leider ist das Problem der optimalen Bestimmung der Hyperebenparameter im Allgemeinen NP-schwer. Inspiriert durch die zugrunde liegende Problemstruktur werden in dieser Arbeit daher zwei Heuristiken zur
näherungsweisen Lösung dieses Problems entwickelt. Die erste ist eine Kreuzentropiemethode, die iterativ Stichproben von der von-Mises-Fisher-Verteilung zieht und deren
Parameter mithilfe der besten Elemente daraus verbessert. Die zweite ist ein Simulated-Annealing-Verfahren, das eine Pivotstrategie zur Erkundung des Lösungsraums nutzt.
Aufgrund der gleichzeitigen Verwendung aller numerischen Merkmale sind generelle
Hyperebenenaufteilungen jedoch schwer zu interpretieren. Als Alternative wird in dieser
Arbeit daher die Verwendung von bivariaten Hyperebenenaufteilungen vorgeschlagen,
die Linien in dem von zwei Merkmalen aufgespannten Unterraum entsprechen. Mit diesen ist es möglich, den Merkmalsraum deutlich effizienter zu unterteilen als mit univariaten Aufteilungen. Gleichzeitig sind sie aufgrund der Beschränkung auf zwei Merkmale
gut interpretierbar. Zur optimalen Bestimmung der bivariaten Hyperebenenaufteilungen
wird ein Branch-and-Bound-Verfahren vorgestellt.
Darüber hinaus wird ein Branch-and-Bound-Verfahren zur Bestimmung optimaler
Kreuzaufteilungen entwickelt. Diese können als Kombination von zwei standardmäßigen
univariaten Aufteilung betrachtet werden und sind in Situationen nützlich, in denen die
Datenpunkte nur schlecht durch einzelne lineare Aufteilungen separiert werden können.
Die entwickelten unteren Schranken für verunreinigungsbasierte Aufteilungskriterien motivieren ebenfalls ein einfaches, aber effektives Branch-and-Bound-Verfahren zur
Bestimmung optimaler Aufteilungen nominaler Merkmale. Aufgrund der Komplexität
des zugrunde liegenden Optimierungsproblems musste man bisher nominale Merkmale
mittels Kodierungsschemata in numerische umwandeln oder Heuristiken nutzen, um suboptimale nominale Aufteilungen zu bestimmen. Das vorgeschlagene Branch-and-Bound-Verfahren bietet eine nützliche Alternative für viele praktische Anwendungsfälle.
Schließlich wird ein genetischer Algorithmus zur Induktion von Entscheidungsbäumen
als Alternative zur Top-Down-Methode vorgestellt.Decision trees are among the most commonly used machine learning models for solving
classification and regression tasks due to their major advantage of being easy to interpret.
However, their predictions are often not as accurate as those of other models.
The most widely used approach for learning decision trees is to build them in a top-down manner by introducing splits on a single variable that minimize a certain splitting
criterion. One possibility of improving this strategy to induce smaller and more accurate
decision trees is to allow different types of splits which, for example, consider multiple
features simultaneously. However, finding such splits is usually much more complex and
effective optimization methods are needed to determine optimal solutions.
An alternative to univarate splits for numerical features are oblique splits which
employ affine hyperplanes to divide the feature space. Unfortunately, the problem of
determining such a split optimally is known to be NP-hard in general. Inspired by the
underlying problem structure, two new heuristics are developed for finding near-optimal
oblique splits. The first one is a cross-entropy optimization method which iteratively
samples points from the von Mises-Fisher distribution and updates its parameters based
on the best performing samples. The second one is a simulated annealing algorithm that
uses a pivoting strategy to explore the solution space.
As general oblique splits employ all of the numerical features simultaneously, they are
hard to interpret. As an alternative, in this thesis, the usage of bivariate oblique splits
is proposed. These splits correspond to lines in the subspace spanned by two features.
They are capable of dividing the feature space much more efficiently than univariate
splits while also being fairly interpretable due to the restriction to two features only.
A branch and bound method is presented to determine these bivariate oblique splits
optimally.
Furthermore, a branch and bound method to determine optimal cross-splits is presented. These splits can be viewed as combinations of two standard univariate splits
on numeric attributes and they are useful in situations where the data points cannot
be separated well linearly. The cross-splits can either be introduced directly to induce
quaternary decision trees or, which is usually better, they can be used to provide a
certain degree of foresight, in which case only the better of the two respective univariate
splits is introduced.
The developed lower bounds for impurity based splitting criteria also motivate a
simple but effective branch and bound algorithm for splits on nominal features. Due to
the complexity of determining such splits optimally when the number of possible values
for the feature is large, one previously had to use encoding schemes to transform the
nominal features into numerical ones or rely on heuristics to find near-optimal nominal
splits. The proposed branch and bound method may be a viable alternative for many
practical applications.
Lastly, a genetic algorithm is proposed as an alternative to the top-down induction
strategy
- …