636 research outputs found
Towards Enhanced Local Explainability of Random Forests: a Proximity-Based Approach
We initiate a novel approach to explain the out of sample performance of
random forest (RF) models by exploiting the fact that any RF can be formulated
as an adaptive weighted K nearest-neighbors model. Specifically, we use the
proximity between points in the feature space learned by the RF to re-write
random forest predictions exactly as a weighted average of the target labels of
training data points. This linearity facilitates a local notion of
explainability of RF predictions that generates attributions for any model
prediction across observations in the training set, and thereby complements
established methods like SHAP, which instead generates attributions for a model
prediction across dimensions of the feature space. We demonstrate this approach
in the context of a bond pricing model trained on US corporate bond trades, and
compare our approach to various existing approaches to model explainability.Comment: 5 pages, 6 figure
yaImpute: An R Package for kNN Imputation
This article introduces yaImpute, an R package for nearest neighbor search and imputation. Although nearest neighbor imputation is used in a host of disciplines, the methods implemented in the yaImpute package are tailored to imputation-based forest attribute estimation and mapping. The impetus to writing the yaImpute is a growing interest in nearest neighbor imputation methods for spatially explicit forest inventory, and a need within this research community for software that facilitates comparison among different nearest neighbor search algorithms and subsequent imputation techniques. yaImpute provides directives for defining the search space, subsequent distance calculation, and imputation rules for a given number of nearest neighbors. Further, the package offers a suite of diagnostics for comparison among results generated from different imputation analyses and a set of functions for mapping imputation results.
Nearest Neighbor and Kernel Survival Analysis: Nonasymptotic Error Bounds and Strong Consistency Rates
We establish the first nonasymptotic error bounds for Kaplan-Meier-based
nearest neighbor and kernel survival probability estimators where feature
vectors reside in metric spaces. Our bounds imply rates of strong consistency
for these nonparametric estimators and, up to a log factor, match an existing
lower bound for conditional CDF estimation. Our proof strategy also yields
nonasymptotic guarantees for nearest neighbor and kernel variants of the
Nelson-Aalen cumulative hazards estimator. We experimentally compare these
methods on four datasets. We find that for the kernel survival estimator, a
good choice of kernel is one learned using random survival forests.Comment: International Conference on Machine Learning (ICML 2019
Exclusive lasso-based k-nearest-neighbor classification
Conventionally, the k nearest-neighbor (kNN) classification is implemented with the use of the Euclidean distance-based measures, which are mainly the one-to-one similarity relationships such as to lose the connections between different samples. As a strategy to alleviate this issue, the coefficients coded by sparse representation have played a role of similarity gauger for nearest-neighbor classification as well. Although SR coefficients enjoy remarkable discrimination nature as a one-to-many relationship, it carries out variable selection at the individual level so that possible inherent group structure is ignored. In order to make the most of information implied in the group structure, this paper employs the exclusive lasso strategy to perform the similarity evaluation in two novel nearest-neighbor classification methods. Experimental results on both benchmark data sets and the face recognition problem demonstrate that the EL-based kNN method outperforms certain state-of-the-art classification techniques and existing representation-based nearest-neighbor approaches, in terms of both the size of feature reduction and the classification accuracy
An Optimal k Nearest Neighbours Ensemble for Classification Based on Extended Neighbourhood Rule with Features subspace
To minimize the effect of outliers, kNN ensembles identify a set of closest
observations to a new sample point to estimate its unknown class by using
majority voting in the labels of the training instances in the neighbourhood.
Ordinary kNN based procedures determine k closest training observations in the
neighbourhood region (enclosed by a sphere) by using a distance formula. The k
nearest neighbours procedure may not work in a situation where sample points in
the test data follow the pattern of the nearest observations that lie on a
certain path not contained in the given sphere of nearest neighbours.
Furthermore, these methods combine hundreds of base kNN learners and many of
them might have high classification errors thereby resulting in poor ensembles.
To overcome these problems, an optimal extended neighbourhood rule based
ensemble is proposed where the neighbours are determined in k steps. It starts
from the first nearest sample point to the unseen observation. The second
nearest data point is identified that is closest to the previously selected
data point. This process is continued until the required number of the k
observations are obtained. Each base model in the ensemble is constructed on a
bootstrap sample in conjunction with a random subset of features. After
building a sufficiently large number of base models, the optimal models are
then selected based on their performance on out-of-bag (OOB) data.Comment: 12 page
Banzhaf random forests: Cooperative game theory based random forests with consistency.
Random forests algorithms have been widely used in many classification and regression applications. However, the theory of random forests lags far behind their applications. In this paper, we propose a novel random forests classification algorithm based on cooperative game theory. The Banzhaf power index is employed to evaluate the power of each feature by traversing possible feature coalitions. Hence, we call the proposed algorithm Banzhaf random forests (BRFs). Unlike the previously used information gain ratio, which only measures the power of each feature for classification and pays less attention to the intrinsic structure of the feature variables, the Banzhaf power index can measure the importance of each feature by computing the dependency among the group of features. More importantly, we have proved the consistency of BRFs, which narrows the gap between the theory and applications of random forests. Extensive experiments on several UCI benchmark data sets and three real world applications show that BRFs perform significantly better than existing consistent random forests on classification accuracy, and better than or at least comparable with Breiman’s random forests, support vector machines (SVMs) and k-nearest neighbors (KNNs) classifiers
Utilizing Data Mining Techniques and Ensemble Learning to Predict Development of Surgical Site Infections in Gynecologic Cancer Patients
Surgical site infections are costly to both patients and hospitals, increase patient mortality, and are the most common form of a hospital acquired infection. Gynecological cancer surgery patients are already at higher risk of developing an infection due to the suppression of their immune system. This research leverages popular data mining techniques to create a prediction model to identify high risk patients. Implemented techniques include logistic regression, naive Bayes, recursive partitioning and regression trees, random forest, feed forward neural network, k-nearest neighbor, and support vector machines with linear kernel. Weighted stacked generalization was implemented to improve upon the individual base level model’s performance. The chosen meta level classifiers were support vector machines with linear kernel, logistic regression, and k-nearest neighbor. The result is a model that identifies high-risk patients immediately following a surgical procedure with an AUC of 0.6864, accuracy of 0.6744, sensitivity of 0.7, and specificity of 0.6728
Analysis of purely random forests bias
Random forests are a very effective and commonly used statistical method, but
their full theoretical analysis is still an open problem. As a first step,
simplified models such as purely random forests have been introduced, in order
to shed light on the good performance of random forests. In this paper, we
study the approximation error (the bias) of some purely random forest models in
a regression framework, focusing in particular on the influence of the number
of trees in the forest. Under some regularity assumptions on the regression
function, we show that the bias of an infinite forest decreases at a faster
rate (with respect to the size of each tree) than a single tree. As a
consequence, infinite forests attain a strictly better risk rate (with respect
to the sample size) than single trees. Furthermore, our results allow to derive
a minimum number of trees sufficient to reach the same rate as an infinite
forest. As a by-product of our analysis, we also show a link between the bias
of purely random forests and the bias of some kernel estimators
- …