6 research outputs found
A Fuzzy k-Nearest Neighbors Classifier to Deal with Imperfect Data
© 2018. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/ This document is the Accepted version of a Published Work that appeared in final form in Soft Computing. To access the final edited and published work see https://doi.org/10.1007/s00500-017-2567-xThe k-nearest neighbors method (kNN) is a nonparametric, instance-based method used for regression and
classification. To classify a new instance, the kNN method computes its k nearest neighbors and generates a class value from them. Usually, this method requires that the information available in the datasets be precise and accurate, except for the existence of missing values. However, data imperfection is inevitable when dealing with real-world scenarios. In this paper, we present the kNNimp classifier, a k-nearest neighbors method to perform classification from datasets with imperfect value. The importance of each neighbor in the output decision is based on relative distance and its degree of imperfection. Furthermore, by using external parameters, the classifier enables us to define the maximum allowed imperfection, and to decide if the final output could be derived solely from the greatest weight class (the best class) or from the best class and a weighted combination of the closest classes to the best one. To test the proposed method, we performed several experiments with both synthetic and realworld datasets with imperfect data. The results, validated through statistical tests, show that the kNNimp classifier is robust when working with imperfect data and maintains a
good performance when compared with other methods in the literature, applied to datasets with or without imperfection
Statistical Theory for Imbalanced Binary Classification
Within the vast body of statistical theory developed for binary
classification, few meaningful results exist for imbalanced classification, in
which data are dominated by samples from one of the two classes. Existing
theory faces at least two main challenges. First, meaningful results must
consider more complex performance measures than classification accuracy. To
address this, we characterize a novel generalization of the Bayes-optimal
classifier to any performance metric computed from the confusion matrix, and we
use this to show how relative performance guarantees can be obtained in terms
of the error of estimating the class probability function under uniform
() loss. Second, as we show, optimal classification
performance depends on certain properties of class imbalance that have not
previously been formalized. Specifically, we propose a novel sub-type of class
imbalance, which we call Uniform Class Imbalance. We analyze how Uniform Class
Imbalance influences optimal classifier performance and show that it
necessitates different classifier behavior than other types of class imbalance.
We further illustrate these two contributions in the case of -nearest
neighbor classification, for which we develop novel guarantees. Together, these
results provide some of the first meaningful finite-sample statistical theory
for imbalanced binary classification.Comment: Parts of this paper have been revised from arXiv:2004.04715v2
[math.ST